From 5fe690a5946442f7f2d08448341dd621367f80bc Mon Sep 17 00:00:00 2001 From: Matthew Cassell Date: Mon, 22 Jul 2024 17:13:16 +0000 Subject: [PATCH 001/416] mm: add node_reclaim successes to VM event counters /proc/vmstat currently shows the number of node_reclaim() failures when vm.zone_reclaim_mode is set appropriately. It would be convenient to have the number of successes right next to zone_reclaim_failed (similar to compaction and migration). While just a trivially addition to the vmstat file. It was helpful during benchmarking to not have to probe node_reclaim() to observe the success/failure ratio. Link: https://lkml.kernel.org/r/20240722171316.7517-1-mcassell411@gmail.com Signed-off-by: Matthew Cassell Cc: Domenico Cerasuolo Cc: "Huang, Ying" Cc: Li Zhijian Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/vm_event_item.h | 1 + mm/vmscan.c | 4 +++- mm/vmstat.c | 1 + 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 747943bc8cc2..6fff8ed3b1fd 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -50,6 +50,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSTEAL_ANON, PGSTEAL_FILE, #ifdef CONFIG_NUMA + PGSCAN_ZONE_RECLAIM_SUCCESS, PGSCAN_ZONE_RECLAIM_FAILED, #endif PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, diff --git a/mm/vmscan.c b/mm/vmscan.c index cfa839284b92..c9682c8a1d75 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -7539,7 +7539,9 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) ret = __node_reclaim(pgdat, gfp_mask, order); clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags); - if (!ret) + if (ret) + count_vm_event(PGSCAN_ZONE_RECLAIM_SUCCESS); + else count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED); return ret; diff --git a/mm/vmstat.c b/mm/vmstat.c index e875f2a4915f..1ac2a569b6cf 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1314,6 +1314,7 @@ const char * const vmstat_text[] = { "pgsteal_file", #ifdef CONFIG_NUMA + "zone_reclaim_success", "zone_reclaim_failed", #endif "pginodesteal", From 3ddc2fefe6f34ec870fcb22f4cd656e53db079f2 Mon Sep 17 00:00:00 2001 From: Danilo Krummrich Date: Mon, 22 Jul 2024 18:29:23 +0200 Subject: [PATCH 002/416] mm: vmalloc: implement vrealloc() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Align kvrealloc() with krealloc()", v2. Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. In order to be able to get rid of kvrealloc()'s oldsize parameter, introduce vrealloc() and make use of it in kvrealloc(). Making use of vrealloc() in kvrealloc() also provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. Besides the above, those functions are required by Rust's allocator abstractons [1] (rework based on this series in [2]). With `Vec` or `KVec` respectively, potentially growing (and shrinking) data structures are rather common. [1] https://lore.kernel.org/lkml/20240704170738.3621-1-dakr@redhat.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/dakr/linux.git/log/?h=rust/mm This patch (of 2): Implement vrealloc() analogous to krealloc(). Currently, krealloc() requires the caller to pass the size of the previous memory allocation, which, instead, should be self-contained. We attempt to fix this in a subsequent patch which, in order to do so, requires vrealloc(). Besides that, we need realloc() functions for kernel allocators in Rust too. With `Vec` or `KVec` respectively, potentially growing (and shrinking) data structures are rather common. [dakr@kernel.org: fix missing nommu implementation] Link: https://lkml.kernel.org/r/20240725141227.13954-1-dakr@kernel.org [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: consider spare memory for __GFP_ZERO] Link: https://lkml.kernel.org/r/20240730185049.6244-3-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-4-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-1-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-2-dakr@kernel.org Signed-off-by: Danilo Krummrich Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Chandan Babu R Cc: Christian König Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: Kees Cook Cc: Marc Zyngier Cc: Michael Ellerman Cc: Miguel Ojeda Cc: Oliver Upton Cc: Pekka Enberg Cc: Roman Gushchin Cc: Uladzislau Rezki Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- include/linux/vmalloc.h | 4 +++ mm/nommu.c | 5 +++ mm/vmalloc.c | 70 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 79 insertions(+) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index e4a631ec430b..ad2ce7a6ab7a 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -189,6 +189,10 @@ extern void *__vcalloc_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1 extern void *vcalloc_noprof(size_t n, size_t size) __alloc_size(1, 2); #define vcalloc(...) alloc_hooks(vcalloc_noprof(__VA_ARGS__)) +void * __must_check vrealloc_noprof(const void *p, size_t size, gfp_t flags) + __realloc_size(2); +#define vrealloc(...) alloc_hooks(vrealloc_noprof(__VA_ARGS__)) + extern void vfree(const void *addr); extern void vfree_atomic(const void *addr); diff --git a/mm/nommu.c b/mm/nommu.c index 7296e775e04e..40cac1348b40 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -126,6 +126,11 @@ void *__vmalloc_noprof(unsigned long size, gfp_t gfp_mask) } EXPORT_SYMBOL(__vmalloc_noprof); +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags) +{ + return krealloc_noprof(p, size, (flags | __GFP_COMP) & ~__GFP_HIGHMEM); +} + void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align, unsigned long start, unsigned long end, gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags, int node, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index af2de36549d6..884e21fed28f 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -4030,6 +4030,76 @@ void *vzalloc_node_noprof(unsigned long size, int node) } EXPORT_SYMBOL(vzalloc_node_noprof); +/** + * vrealloc - reallocate virtually contiguous memory; contents remain unchanged + * @p: object to reallocate memory for + * @size: the size to reallocate + * @flags: the flags for the page level allocator + * + * If @p is %NULL, vrealloc() behaves exactly like vmalloc(). If @size is 0 and + * @p is not a %NULL pointer, the object pointed to is freed. + * + * If __GFP_ZERO logic is requested, callers must ensure that, starting with the + * initial memory allocation, every subsequent call to this API for the same + * memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that + * __GFP_ZERO is not fully honored by this API. + * + * In any case, the contents of the object pointed to are preserved up to the + * lesser of the new and old sizes. + * + * This function must not be called concurrently with itself or vfree() for the + * same memory allocation. + * + * Return: pointer to the allocated memory; %NULL if @size is zero or in case of + * failure + */ +void *vrealloc_noprof(const void *p, size_t size, gfp_t flags) +{ + size_t old_size = 0; + void *n; + + if (!size) { + vfree(p); + return NULL; + } + + if (p) { + struct vm_struct *vm; + + vm = find_vm_area(p); + if (unlikely(!vm)) { + WARN(1, "Trying to vrealloc() nonexistent vm area (%p)\n", p); + return NULL; + } + + old_size = get_vm_area_size(vm); + } + + /* + * TODO: Shrink the vm_area, i.e. unmap and free unused pages. What + * would be a good heuristic for when to shrink the vm_area? + */ + if (size <= old_size) { + /* Zero out spare memory. */ + if (want_init_on_alloc(flags)) + memset((void *)p + size, 0, old_size - size); + + return (void *)p; + } + + /* TODO: Grow the vm_area, i.e. allocate and map additional pages. */ + n = __vmalloc_noprof(size, flags); + if (!n) + return NULL; + + if (p) { + memcpy(n, p, old_size); + vfree(p); + } + + return n; +} + #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32) #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL) #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA) From 590b9d576caec6b4c46bba49ed36223a399c3fc5 Mon Sep 17 00:00:00 2001 From: Danilo Krummrich Date: Mon, 22 Jul 2024 18:29:24 +0200 Subject: [PATCH 003/416] mm: kvmalloc: align kvrealloc() with krealloc() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Chandan Babu R Cc: Christian König Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Rientjes Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: Kees Cook Cc: Marc Zyngier Cc: Michael Ellerman Cc: Miguel Ojeda Cc: Oliver Upton Cc: Pekka Enberg Cc: Roman Gushchin Cc: Uladzislau Rezki Cc: Wedson Almeida Filho Signed-off-by: Andrew Morton --- arch/arm64/kvm/nested.c | 1 - arch/powerpc/platforms/pseries/papr-vpd.c | 5 +- drivers/gpu/drm/drm_exec.c | 3 +- fs/xfs/xfs_log_recover.c | 2 +- include/linux/slab.h | 4 +- kernel/resource.c | 3 +- lib/fortify_kunit.c | 3 +- mm/util.c | 100 +++++++++++++++------- 8 files changed, 77 insertions(+), 44 deletions(-) diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index bab27f9d8cc6..8632c05e7997 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -62,7 +62,6 @@ int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu) */ num_mmus = atomic_read(&kvm->online_vcpus) * S2_MMU_PER_VCPU; tmp = kvrealloc(kvm->arch.nested_mmus, - size_mul(sizeof(*kvm->arch.nested_mmus), kvm->arch.nested_mmus_size), size_mul(sizeof(*kvm->arch.nested_mmus), num_mmus), GFP_KERNEL_ACCOUNT | __GFP_ZERO); if (!tmp) diff --git a/arch/powerpc/platforms/pseries/papr-vpd.c b/arch/powerpc/platforms/pseries/papr-vpd.c index c29e85db5f35..1574176e3ffc 100644 --- a/arch/powerpc/platforms/pseries/papr-vpd.c +++ b/arch/powerpc/platforms/pseries/papr-vpd.c @@ -156,10 +156,7 @@ static int vpd_blob_extend(struct vpd_blob *blob, const char *data, size_t len) const char *old_ptr = blob->data; char *new_ptr; - new_ptr = old_ptr ? - kvrealloc(old_ptr, old_len, new_len, GFP_KERNEL_ACCOUNT) : - kvmalloc(len, GFP_KERNEL_ACCOUNT); - + new_ptr = kvrealloc(old_ptr, new_len, GFP_KERNEL_ACCOUNT); if (!new_ptr) return -ENOMEM; diff --git a/drivers/gpu/drm/drm_exec.c b/drivers/gpu/drm/drm_exec.c index 2da094bdf8a4..18e366cc4993 100644 --- a/drivers/gpu/drm/drm_exec.c +++ b/drivers/gpu/drm/drm_exec.c @@ -145,8 +145,7 @@ static int drm_exec_obj_locked(struct drm_exec *exec, size_t size = exec->max_objects * sizeof(void *); void *tmp; - tmp = kvrealloc(exec->objects, size, size + PAGE_SIZE, - GFP_KERNEL); + tmp = kvrealloc(exec->objects, size + PAGE_SIZE, GFP_KERNEL); if (!tmp) return -ENOMEM; diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 4423dd344239..1997981827fb 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2128,7 +2128,7 @@ xlog_recover_add_to_cont_trans( old_ptr = item->ri_buf[item->ri_cnt-1].i_addr; old_len = item->ri_buf[item->ri_cnt-1].i_len; - ptr = kvrealloc(old_ptr, old_len, len + old_len, GFP_KERNEL); + ptr = kvrealloc(old_ptr, len + old_len, GFP_KERNEL); if (!ptr) return -ENOMEM; memcpy(&ptr[old_len], dp, len); diff --git a/include/linux/slab.h b/include/linux/slab.h index eb2bf4629157..c9cb42203183 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -841,8 +841,8 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) #define kvcalloc_node(...) alloc_hooks(kvcalloc_node_noprof(__VA_ARGS__)) #define kvcalloc(...) alloc_hooks(kvcalloc_noprof(__VA_ARGS__)) -extern void *kvrealloc_noprof(const void *p, size_t oldsize, size_t newsize, gfp_t flags) - __realloc_size(3); +void *kvrealloc_noprof(const void *p, size_t size, gfp_t flags) + __realloc_size(2); #define kvrealloc(...) alloc_hooks(kvrealloc_noprof(__VA_ARGS__)) extern void kvfree(const void *addr); diff --git a/kernel/resource.c b/kernel/resource.c index 14777afb0a99..9f747bb7cd03 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -450,8 +450,7 @@ int walk_system_ram_res_rev(u64 start, u64 end, void *arg, /* re-alloc */ struct resource *rams_new; - rams_new = kvrealloc(rams, rams_size * sizeof(struct resource), - (rams_size + 16) * sizeof(struct resource), + rams_new = kvrealloc(rams, (rams_size + 16) * sizeof(struct resource), GFP_KERNEL); if (!rams_new) goto out; diff --git a/lib/fortify_kunit.c b/lib/fortify_kunit.c index f9ad60a9c7bd..ecb638d4cde1 100644 --- a/lib/fortify_kunit.c +++ b/lib/fortify_kunit.c @@ -306,8 +306,7 @@ DEFINE_ALLOC_SIZE_TEST_PAIR(vmalloc) orig = kvmalloc(prev_size, gfp); \ KUNIT_EXPECT_TRUE(test, orig != NULL); \ checker(((expected_pages) * PAGE_SIZE) * 2, \ - kvrealloc(orig, prev_size, \ - ((alloc_pages) * PAGE_SIZE) * 2, gfp), \ + kvrealloc(orig, ((alloc_pages) * PAGE_SIZE) * 2, gfp), \ kvfree(p)); \ } while (0) DEFINE_ALLOC_SIZE_TEST_PAIR(kvmalloc) diff --git a/mm/util.c b/mm/util.c index bd283e2132e0..ac01925a4179 100644 --- a/mm/util.c +++ b/mm/util.c @@ -608,6 +608,28 @@ unsigned long vm_mmap(struct file *file, unsigned long addr, } EXPORT_SYMBOL(vm_mmap); +static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size) +{ + /* + * We want to attempt a large physically contiguous block first because + * it is less likely to fragment multiple larger blocks and therefore + * contribute to a long term fragmentation less than vmalloc fallback. + * However make sure that larger requests are not too disruptive - no + * OOM killer and no allocation failure warnings as we have a fallback. + */ + if (size > PAGE_SIZE) { + flags |= __GFP_NOWARN; + + if (!(flags & __GFP_RETRY_MAYFAIL)) + flags |= __GFP_NORETRY; + + /* nofail semantic is implemented by the vmalloc fallback */ + flags &= ~__GFP_NOFAIL; + } + + return flags; +} + /** * __kvmalloc_node - attempt to allocate physically contiguous memory, but upon * failure, fall back to non-contiguous (vmalloc) allocation. @@ -627,32 +649,15 @@ EXPORT_SYMBOL(vm_mmap); */ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) { - gfp_t kmalloc_flags = flags; void *ret; - /* - * We want to attempt a large physically contiguous block first because - * it is less likely to fragment multiple larger blocks and therefore - * contribute to a long term fragmentation less than vmalloc fallback. - * However make sure that larger requests are not too disruptive - no - * OOM killer and no allocation failure warnings as we have a fallback. - */ - if (size > PAGE_SIZE) { - kmalloc_flags |= __GFP_NOWARN; - - if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL)) - kmalloc_flags |= __GFP_NORETRY; - - /* nofail semantic is implemented by the vmalloc fallback */ - kmalloc_flags &= ~__GFP_NOFAIL; - } - - ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b), kmalloc_flags, node); - /* * It doesn't really make sense to fallback to vmalloc for sub page * requests */ + ret = __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, b), + kmalloc_gfp_adjust(flags, size), + node); if (ret || size <= PAGE_SIZE) return ret; @@ -715,18 +720,53 @@ void kvfree_sensitive(const void *addr, size_t len) } EXPORT_SYMBOL(kvfree_sensitive); -void *kvrealloc_noprof(const void *p, size_t oldsize, size_t newsize, gfp_t flags) +/** + * kvrealloc - reallocate memory; contents remain unchanged + * @p: object to reallocate memory for + * @size: the size to reallocate + * @flags: the flags for the page level allocator + * + * If @p is %NULL, kvrealloc() behaves exactly like kvmalloc(). If @size is 0 + * and @p is not a %NULL pointer, the object pointed to is freed. + * + * If __GFP_ZERO logic is requested, callers must ensure that, starting with the + * initial memory allocation, every subsequent call to this API for the same + * memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that + * __GFP_ZERO is not fully honored by this API. + * + * In any case, the contents of the object pointed to are preserved up to the + * lesser of the new and old sizes. + * + * This function must not be called concurrently with itself or kvfree() for the + * same memory allocation. + * + * Return: pointer to the allocated memory or %NULL in case of error + */ +void *kvrealloc_noprof(const void *p, size_t size, gfp_t flags) { - void *newp; + void *n; - if (oldsize >= newsize) - return (void *)p; - newp = kvmalloc_noprof(newsize, flags); - if (!newp) - return NULL; - memcpy(newp, p, oldsize); - kvfree(p); - return newp; + if (is_vmalloc_addr(p)) + return vrealloc_noprof(p, size, flags); + + n = krealloc_noprof(p, size, kmalloc_gfp_adjust(flags, size)); + if (!n) { + /* We failed to krealloc(), fall back to kvmalloc(). */ + n = kvmalloc_noprof(size, flags); + if (!n) + return NULL; + + if (p) { + /* We already know that `p` is not a vmalloc address. */ + kasan_disable_current(); + memcpy(n, kasan_reset_tag(p), ksize(p)); + kasan_enable_current(); + + kfree(p); + } + } + + return n; } EXPORT_SYMBOL(kvrealloc_noprof); From 0bedf001e359368badbdefe722162ed4dd33e296 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 22 Jul 2024 13:43:17 +0800 Subject: [PATCH 004/416] mm: shmem: simplify the suitable huge orders validation for tmpfs Patch series "Some cleanups for shmem", v3. This series does some cleanups to reuse code, rename functions and simplify logic to make code more clear. No functional changes are expected. This patch (of 3): Move the suitable huge orders validation into shmem_suitable_orders() for tmpfs, which can reuse some code to simplify the logic. In addition, we don't have special handling for the error code -E2BIG when checking for conflicts with PMD sized THP in the pagecache for tmpfs, instead, it will just fallback to order-0 allocations like this patch does, so this simplification will not add functional changes. Link: https://lkml.kernel.org/r/cover.1721626645.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/965985dd6d322929d78a0beee0dafa1c2a1b81e2.1721626645.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Ryan Roberts Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Lance Yang Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 39 +++++++++++++++------------------------ 1 file changed, 15 insertions(+), 24 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 5a77acf6ac6a..7889b499d33f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1680,20 +1680,30 @@ static unsigned long shmem_suitable_orders(struct inode *inode, struct vm_fault struct address_space *mapping, pgoff_t index, unsigned long orders) { - struct vm_area_struct *vma = vmf->vma; + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; pgoff_t aligned_index; unsigned long pages; int order; - orders = thp_vma_suitable_orders(vma, vmf->address, orders); - if (!orders) - return 0; + if (vma) { + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + if (!orders) + return 0; + } /* Find the highest order that can add into the page cache */ order = highest_order(orders); while (orders) { pages = 1UL << order; aligned_index = round_down(index, pages); + /* + * Check for conflict before waiting on a huge allocation. + * Conflict might be that a huge page has just been allocated + * and added to page cache by a racing thread, or that there + * is already at least one small page in the huge extent. + * Be careful to retry when appropriate, but not forever! + * Elsewhere -EEXIST would be the right code, but not here. + */ if (!xa_find(&mapping->i_pages, &aligned_index, aligned_index + pages - 1, XA_PRESENT)) break; @@ -1731,7 +1741,6 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); - struct vm_area_struct *vma = vmf ? vmf->vma : NULL; unsigned long suitable_orders = 0; struct folio *folio = NULL; long pages; @@ -1741,26 +1750,8 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, orders = 0; if (orders > 0) { - if (vma && vma_is_anon_shmem(vma)) { - suitable_orders = shmem_suitable_orders(inode, vmf, + suitable_orders = shmem_suitable_orders(inode, vmf, mapping, index, orders); - } else if (orders & BIT(HPAGE_PMD_ORDER)) { - pages = HPAGE_PMD_NR; - suitable_orders = BIT(HPAGE_PMD_ORDER); - index = round_down(index, HPAGE_PMD_NR); - - /* - * Check for conflict before waiting on a huge allocation. - * Conflict might be that a huge page has just been allocated - * and added to page cache by a racing thread, or that there - * is already at least one small page in the huge extent. - * Be careful to retry when appropriate, but not forever! - * Elsewhere -EEXIST would be the right code, but not here. - */ - if (xa_find(&mapping->i_pages, &index, - index + HPAGE_PMD_NR - 1, XA_PRESENT)) - return ERR_PTR(-E2BIG); - } order = highest_order(suitable_orders); while (suitable_orders) { From d58a2a581f132529eefac5377676011562b631b8 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 22 Jul 2024 13:43:18 +0800 Subject: [PATCH 005/416] mm: shmem: rename shmem_is_huge() to shmem_huge_global_enabled() shmem_is_huge() is now used to check if the top-level huge page is enabled, thus rename it to reflect its usage. Link: https://lkml.kernel.org/r/da53296e0ab6359aa083561d9dc01e4223d60fbe.1721626645.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Ryan Roberts Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Lance Yang Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/shmem_fs.h | 9 +++++---- mm/huge_memory.c | 5 +++-- mm/shmem.c | 15 ++++++++------- 3 files changed, 16 insertions(+), 13 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 1d06b1e5408a..405ee8d3589a 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -111,14 +111,15 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); int shmem_unuse(unsigned int type); #ifdef CONFIG_TRANSPARENT_HUGEPAGE -extern bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, - struct mm_struct *mm, unsigned long vm_flags); +extern bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, bool shmem_huge_force, + struct mm_struct *mm, unsigned long vm_flags); unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, bool global_huge); #else -static __always_inline bool shmem_is_huge(struct inode *inode, pgoff_t index, bool shmem_huge_force, - struct mm_struct *mm, unsigned long vm_flags) +static __always_inline bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, + bool shmem_huge_force, struct mm_struct *mm, + unsigned long vm_flags) { return false; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 67c86a5d64a6..3af9366b1c4c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -160,8 +160,9 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, * own flags. */ if (!in_pf && shmem_file(vma->vm_file)) { - bool global_huge = shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, - !enforce_sysfs, vma->vm_mm, vm_flags); + bool global_huge = shmem_huge_global_enabled(file_inode(vma->vm_file), + vma->vm_pgoff, !enforce_sysfs, + vma->vm_mm, vm_flags); if (!vma_is_anon_shmem(vma)) return global_huge ? orders : 0; diff --git a/mm/shmem.c b/mm/shmem.c index 7889b499d33f..2a86b0d9f516 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -548,9 +548,9 @@ static bool shmem_confirm_swap(struct address_space *mapping, static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER; -static bool __shmem_is_huge(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct mm_struct *mm, - unsigned long vm_flags) +static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, + bool shmem_huge_force, struct mm_struct *mm, + unsigned long vm_flags) { loff_t i_size; @@ -581,14 +581,15 @@ static bool __shmem_is_huge(struct inode *inode, pgoff_t index, } } -bool shmem_is_huge(struct inode *inode, pgoff_t index, +bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, bool shmem_huge_force, struct mm_struct *mm, unsigned long vm_flags) { if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) return false; - return __shmem_is_huge(inode, index, shmem_huge_force, mm, vm_flags); + return __shmem_huge_global_enabled(inode, index, shmem_huge_force, + mm, vm_flags); } #if defined(CONFIG_SYSFS) @@ -1156,7 +1157,7 @@ static int shmem_getattr(struct mnt_idmap *idmap, STATX_ATTR_NODUMP); generic_fillattr(idmap, request_mask, inode, stat); - if (shmem_is_huge(inode, 0, false, NULL, 0)) + if (shmem_huge_global_enabled(inode, 0, false, NULL, 0)) stat->blksize = HPAGE_PMD_SIZE; if (request_mask & STATX_BTIME) { @@ -2149,7 +2150,7 @@ repeat: return 0; } - huge = shmem_is_huge(inode, index, false, fault_mm, + huge = shmem_huge_global_enabled(inode, index, false, fault_mm, vma ? vma->vm_flags : 0); /* Find hugepage orders that are allowed for anonymous shmem. */ if (vma && vma_is_anon_shmem(vma)) From 6beeab870e70b2d4f49baf6c6be9da1b61c169f8 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 22 Jul 2024 13:43:19 +0800 Subject: [PATCH 006/416] mm: shmem: move shmem_huge_global_enabled() into shmem_allowable_huge_orders() Move shmem_huge_global_enabled() into shmem_allowable_huge_orders(), so that shmem_allowable_huge_orders() can also help to find the allowable huge orders for tmpfs. Moreover the shmem_huge_global_enabled() can become static. While we are at it, passing the vma instead of mm for shmem_huge_global_enabled() makes code cleaner. No functional changes. Link: https://lkml.kernel.org/r/8e825146bb29ee1a1c7bd64d2968ff3e19be7815.1721626645.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Ryan Roberts Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Lance Yang Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/shmem_fs.h | 12 ++-------- mm/huge_memory.c | 12 +++------- mm/shmem.c | 47 +++++++++++++++++++++++++--------------- 3 files changed, 35 insertions(+), 36 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 405ee8d3589a..1564d7d3ca61 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -111,21 +111,13 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); int shmem_unuse(unsigned int type); #ifdef CONFIG_TRANSPARENT_HUGEPAGE -extern bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, bool shmem_huge_force, - struct mm_struct *mm, unsigned long vm_flags); unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool global_huge); + bool shmem_huge_force); #else -static __always_inline bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct mm_struct *mm, - unsigned long vm_flags) -{ - return false; -} static inline unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool global_huge) + bool shmem_huge_force) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3af9366b1c4c..1ec32c02b19c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -159,16 +159,10 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, * Must be done before hugepage flags check since shmem has its * own flags. */ - if (!in_pf && shmem_file(vma->vm_file)) { - bool global_huge = shmem_huge_global_enabled(file_inode(vma->vm_file), - vma->vm_pgoff, !enforce_sysfs, - vma->vm_mm, vm_flags); - - if (!vma_is_anon_shmem(vma)) - return global_huge ? orders : 0; + if (!in_pf && shmem_file(vma->vm_file)) return shmem_allowable_huge_orders(file_inode(vma->vm_file), - vma, vma->vm_pgoff, global_huge); - } + vma, vma->vm_pgoff, + !enforce_sysfs); if (!vma_is_anonymous(vma)) { /* diff --git a/mm/shmem.c b/mm/shmem.c index 2a86b0d9f516..4a5254bfd610 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -549,9 +549,10 @@ static bool shmem_confirm_swap(struct address_space *mapping, static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER; static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct mm_struct *mm, + bool shmem_huge_force, struct vm_area_struct *vma, unsigned long vm_flags) { + struct mm_struct *mm = vma ? vma->vm_mm : NULL; loff_t i_size; if (!S_ISREG(inode->i_mode)) @@ -581,15 +582,15 @@ static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, } } -bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct mm_struct *mm, +static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, + bool shmem_huge_force, struct vm_area_struct *vma, unsigned long vm_flags) { if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) return false; return __shmem_huge_global_enabled(inode, index, shmem_huge_force, - mm, vm_flags); + vma, vm_flags); } #if defined(CONFIG_SYSFS) @@ -772,6 +773,13 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, { return 0; } + +static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, + bool shmem_huge_force, struct vm_area_struct *vma, + unsigned long vm_flags) +{ + return false; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* @@ -1625,22 +1633,33 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) #ifdef CONFIG_TRANSPARENT_HUGEPAGE unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool global_huge) + bool shmem_huge_force) { unsigned long mask = READ_ONCE(huge_shmem_orders_always); unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); - unsigned long vm_flags = vma->vm_flags; + unsigned long vm_flags = vma ? vma->vm_flags : 0; + bool global_huge; loff_t i_size; int order; - if ((vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + if (vma && ((vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))) return 0; /* If the hardware/firmware marked hugepage support disabled. */ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) return 0; + global_huge = shmem_huge_global_enabled(inode, index, shmem_huge_force, + vma, vm_flags); + if (!vma || !vma_is_anon_shmem(vma)) { + /* + * For tmpfs, we now only support PMD sized THP if huge page + * is enabled, otherwise fallback to order 0. + */ + return global_huge ? BIT(HPAGE_PMD_ORDER) : 0; + } + /* * Following the 'deny' semantics of the top level, force the huge * option off from all mounts. @@ -2077,7 +2096,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, struct mm_struct *fault_mm; struct folio *folio; int error; - bool alloced, huge; + bool alloced; unsigned long orders = 0; if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping))) @@ -2150,14 +2169,8 @@ repeat: return 0; } - huge = shmem_huge_global_enabled(inode, index, false, fault_mm, - vma ? vma->vm_flags : 0); - /* Find hugepage orders that are allowed for anonymous shmem. */ - if (vma && vma_is_anon_shmem(vma)) - orders = shmem_allowable_huge_orders(inode, vma, index, huge); - else if (huge) - orders = BIT(HPAGE_PMD_ORDER); - + /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */ + orders = shmem_allowable_huge_orders(inode, vma, index, false); if (orders > 0) { gfp_t huge_gfp; From fcb4824b264058f9ee1ac2aa6b7155468978fde6 Mon Sep 17 00:00:00 2001 From: Valdis Kletnieks Date: Sat, 13 Jul 2024 02:59:50 -0400 Subject: [PATCH 007/416] mm: fix typo in Kconfig Fix typo in Kconfig help Link: https://lkml.kernel.org/r/78656.1720853990@turing-police Fixes: e93d4166b40a ("mm: memcg: put cgroup v1-specific code under a config option") Signed-off-by: Valdis Kletnieks Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Signed-off-by: Andrew Morton --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/init/Kconfig b/init/Kconfig index 5783a0b87517..f34371530e68 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -986,7 +986,7 @@ config MEMCG_V1 going to shrink due to deprecation process. New deployments with v1 controller are highly discouraged. - San N is unsure. + Say N if unsure. config BLK_CGROUP bool "IO controller" From 9eace7e8e60c3ac821c417bd3f3aa5949e58a57a Mon Sep 17 00:00:00 2001 From: Carlos Maiolino Date: Wed, 17 Jul 2024 08:37:27 +0200 Subject: [PATCH 008/416] shmem_quota: build the object file conditionally to the config option Initially I added shmem-quota to obj-y, move it to the correct place and remove the unneeded full file #ifdef Link: https://lkml.kernel.org/r/20240717063737.910840-1-cem@kernel.org Signed-off-by: Carlos Maiolino Suggested-by: Aristeu Rozanski Reviewed-by: Jan Kara Cc: Christian Brauner Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/Makefile | 3 ++- mm/shmem_quota.c | 3 --- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/Makefile b/mm/Makefile index d2915f8c9dc0..b11fefbc13bc 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -53,7 +53,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shrinker.o \ shmem.o util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o percpu.o slab_common.o \ - compaction.o show_mem.o shmem_quota.o\ + compaction.o show_mem.o \ interval_tree.o list_lru.o workingset.o \ debug.o gup.o mmap_lock.o $(mmu-y) @@ -141,3 +141,4 @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o obj-$(CONFIG_EXECMEM) += execmem.o +obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o diff --git a/mm/shmem_quota.c b/mm/shmem_quota.c index ce514e700d2f..d1e32ac01407 100644 --- a/mm/shmem_quota.c +++ b/mm/shmem_quota.c @@ -34,8 +34,6 @@ #include #include -#ifdef CONFIG_TMPFS_QUOTA - /* * The following constants define the amount of time given a user * before the soft limits are treated as hard limits (usually resulting @@ -351,4 +349,3 @@ const struct dquot_operations shmem_quota_operations = { .mark_dirty = shmem_mark_dquot_dirty, .get_next_id = shmem_get_next_id, }; -#endif /* CONFIG_TMPFS_QUOTA */ From c2a967f6ab0ec896648c0497d3dc15d8f136b148 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Thu, 18 Jul 2024 22:25:03 -0600 Subject: [PATCH 009/416] mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO hugetlb_vmemmap_optimize_folio() and hugetlb_vmemmap_restore_folio() are wrappers meant to be called regardless of whether HVO is enabled. Therefore, they should not call synchronize_rcu(). Otherwise, it regresses use cases not enabling HVO. So move synchronize_rcu() to __hugetlb_vmemmap_optimize_folio() and __hugetlb_vmemmap_restore_folio(), and call it once for each batch of folios when HVO is enabled. Link: https://lkml.kernel.org/r/20240719042503.2752316-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202407091001.1250ad4a-oliver.sang@intel.com Reported-by: Janosch Frank Tested-by: Marc Hartmayer Acked-by: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb_vmemmap.c | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 0c3f56b3578e..57b7f591eee8 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -43,6 +43,8 @@ struct vmemmap_remap_walk { #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0) /* Skip the TLB flush when we remap the PTE */ #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1) +/* synchronize_rcu() to avoid writes from page_ref_add_unless() */ +#define VMEMMAP_SYNCHRONIZE_RCU BIT(2) unsigned long flags; }; @@ -457,6 +459,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; + if (flags & VMEMMAP_SYNCHRONIZE_RCU) + synchronize_rcu(); + vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); vmemmap_reuse = vmemmap_start; vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; @@ -489,10 +494,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { - /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ - synchronize_rcu(); - - return __hugetlb_vmemmap_restore_folio(h, folio, 0); + return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU); } /** @@ -515,14 +517,14 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, struct folio *folio, *t_folio; long restored = 0; long ret = 0; - - /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ - synchronize_rcu(); + unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU; list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { - ret = __hugetlb_vmemmap_restore_folio(h, folio, - VMEMMAP_REMAP_NO_TLB_FLUSH); + ret = __hugetlb_vmemmap_restore_folio(h, folio, flags); + /* only need to synchronize_rcu() once for each batch */ + flags &= ~VMEMMAP_SYNCHRONIZE_RCU; + if (ret) break; restored++; @@ -570,6 +572,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, return ret; static_branch_inc(&hugetlb_optimize_vmemmap_key); + + if (flags & VMEMMAP_SYNCHRONIZE_RCU) + synchronize_rcu(); /* * Very Subtle * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed @@ -617,10 +622,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); - /* avoid writes from page_ref_add_unless() while folding vmemmap */ - synchronize_rcu(); - - __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU); free_vmemmap_page_list(&vmemmap_pages); } @@ -647,6 +649,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l { struct folio *folio; LIST_HEAD(vmemmap_pages); + unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU; list_for_each_entry(folio, folio_list, lru) { int ret = hugetlb_vmemmap_split_folio(h, folio); @@ -663,14 +666,12 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); - /* avoid writes from page_ref_add_unless() while folding vmemmap */ - synchronize_rcu(); - list_for_each_entry(folio, folio_list, lru) { int ret; - ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, - VMEMMAP_REMAP_NO_TLB_FLUSH); + ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags); + /* only need to synchronize_rcu() once for each batch */ + flags &= ~VMEMMAP_SYNCHRONIZE_RCU; /* * Pages to be freed may have been accumulated. If we @@ -684,8 +685,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); free_vmemmap_page_list(&vmemmap_pages); INIT_LIST_HEAD(&vmemmap_pages); - __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, - VMEMMAP_REMAP_NO_TLB_FLUSH); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags); } } From c39542732a3d84e851a0ab8ff3106ed3138174a0 Mon Sep 17 00:00:00 2001 From: Peng Hao Date: Tue, 23 Jul 2024 11:55:13 +0800 Subject: [PATCH 010/416] mm/damon/lru_sort: adjust local variable to dynamic allocation When KASAN is enabled and built with clang: mm/damon/lru_sort.c:199:12: error: stack frame size (2328) exceeds limit (2048) in 'damon_lru_sort_apply_parameters' [-Werror,-Wframe-larger-than] static int damon_lru_sort_apply_parameters(void) ^ 1 error generated. This is because damon_lru_sort_quota contains a large array, and assigning this variable to a local variable causes a large amount of stack space to be occupied. So adjust local variable to dynamic allocation. Link: https://lkml.kernel.org/r/20240723035513.20153-1-flyingpeng@tencent.com Signed-off-by: Peng Hao Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/lru_sort.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 4af8fd4a390b..0b35bd5fb659 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -148,12 +148,17 @@ static struct damon_target *target; static struct damos *damon_lru_sort_new_scheme( struct damos_access_pattern *pattern, enum damos_action action) { - struct damos_quota quota = damon_lru_sort_quota; + struct damos *damos; + struct damos_quota *quota = kmemdup(&damon_lru_sort_quota, + sizeof(damon_lru_sort_quota), GFP_KERNEL); + + if (!quota) + return NULL; /* Use half of total quota for hot/cold pages sorting */ - quota.ms = quota.ms / 2; + quota->ms = quota->ms / 2; - return damon_new_scheme( + damos = damon_new_scheme( /* find the pattern, and */ pattern, /* (de)prioritize on LRU-lists */ @@ -161,10 +166,12 @@ static struct damos *damon_lru_sort_new_scheme( /* for each aggregation interval */ 0, /* under the quota. */ - "a, + quota, /* (De)activate this according to the watermarks. */ &damon_lru_sort_wmarks, NUMA_NO_NODE); + kfree(quota); + return damos; } /* Create a DAMON-based operation scheme for hot memory regions */ From 478729533edaa6700ba681b7a71e43715ffe6ccb Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Thu, 18 Jul 2024 17:26:06 -0400 Subject: [PATCH 011/416] mm: cleanup flags usage in faultin_page Patch series "mm: some small page fault cleanups". I was recently wreaking havoc in the page fault code and I noticed some things that could be cleaned up. We no longer modify the gup flags in faultin_page, so we can clean up how we pass the flags in and remove the extra variable in __get_user_pages. This patch (of 2): We're passing a pointer to the foll_flags for faultin_page, however we never modify the flags in this call. Change this to just take the flags value instead. Link: https://lkml.kernel.org/r/2df51a54c06bdf93e1cb09a19a9ef1df6557b59e.1721337845.git.josef@toxicpanda.com Signed-off-by: Josef Bacik Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/gup.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 54d0dc3831fb..985d5141592f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1153,19 +1153,19 @@ unmap: * to 0 and -EBUSY returned. */ static int faultin_page(struct vm_area_struct *vma, - unsigned long address, unsigned int *flags, bool unshare, + unsigned long address, unsigned int flags, bool unshare, int *locked) { unsigned int fault_flags = 0; vm_fault_t ret; - if (*flags & FOLL_NOFAULT) + if (flags & FOLL_NOFAULT) return -EFAULT; - if (*flags & FOLL_WRITE) + if (flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; - if (*flags & FOLL_REMOTE) + if (flags & FOLL_REMOTE) fault_flags |= FAULT_FLAG_REMOTE; - if (*flags & FOLL_UNLOCKABLE) { + if (flags & FOLL_UNLOCKABLE) { fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; /* * FAULT_FLAG_INTERRUPTIBLE is opt-in. GUP callers must set @@ -1173,12 +1173,12 @@ static int faultin_page(struct vm_area_struct *vma, * That's because some callers may not be prepared to * handle early exits caused by non-fatal signals. */ - if (*flags & FOLL_INTERRUPTIBLE) + if (flags & FOLL_INTERRUPTIBLE) fault_flags |= FAULT_FLAG_INTERRUPTIBLE; } - if (*flags & FOLL_NOWAIT) + if (flags & FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; - if (*flags & FOLL_TRIED) { + if (flags & FOLL_TRIED) { /* * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED * can co-exist @@ -1212,7 +1212,7 @@ static int faultin_page(struct vm_area_struct *vma, } if (ret & VM_FAULT_ERROR) { - int err = vm_fault_to_errno(ret, *flags); + int err = vm_fault_to_errno(ret, flags); if (err) return err; @@ -1490,7 +1490,7 @@ retry: page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page || PTR_ERR(page) == -EMLINK) { - ret = faultin_page(vma, start, &foll_flags, + ret = faultin_page(vma, start, foll_flags, PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: From dc21e70079ffcc4b8f14603cdd120713f0400dd3 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Thu, 18 Jul 2024 17:26:07 -0400 Subject: [PATCH 012/416] mm: remove foll_flags in __get_user_pages Now that we're not passing around a pointer to the flags, there's no reason to have an extra variable for the gup_flags, simply pass the gup_flags directly everywhere. Link: https://lkml.kernel.org/r/1e79b84bd30287cc9847f2aeb002374e6e60a10f.1721337845.git.josef@toxicpanda.com Signed-off-by: Josef Bacik Acked-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/gup.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 985d5141592f..120740cf5a34 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1437,7 +1437,6 @@ static long __get_user_pages(struct mm_struct *mm, do { struct page *page; - unsigned int foll_flags = gup_flags; unsigned int page_increm; /* first iteration or cross vma bound */ @@ -1488,9 +1487,9 @@ retry: } cond_resched(); - page = follow_page_mask(vma, start, foll_flags, &ctx); + page = follow_page_mask(vma, start, gup_flags, &ctx); if (!page || PTR_ERR(page) == -EMLINK) { - ret = faultin_page(vma, start, foll_flags, + ret = faultin_page(vma, start, gup_flags, PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: @@ -1547,13 +1546,12 @@ next_page: * large folio, this should never fail. */ if (try_grab_folio(folio, page_increm - 1, - foll_flags)) { + gup_flags)) { /* * Release the 1st page ref if the * folio is problematic, fail hard. */ - gup_put_folio(folio, 1, - foll_flags); + gup_put_folio(folio, 1, gup_flags); ret = -EFAULT; goto out; } From 4fd568faf6e7e21f206cd66dd4fca9a72e7d922c Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 18 Jul 2024 17:18:21 +0800 Subject: [PATCH 013/416] mm: kmem: remove mem_cgroup_from_obj() There is no user of mem_cgroup_from_obj(), remove it. Link: https://lkml.kernel.org/r/20240718091821.44740-1-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 6 ------ mm/memcontrol.c | 32 +------------------------------- 2 files changed, 1 insertion(+), 37 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0e5bf25d324f..af7da7bd00af 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1717,7 +1717,6 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) return memcg ? memcg->kmemcg_id : -1; } -struct mem_cgroup *mem_cgroup_from_obj(void *p); struct mem_cgroup *mem_cgroup_from_slab_obj(void *p); static inline void count_objcg_event(struct obj_cgroup *objcg, @@ -1780,11 +1779,6 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) return -1; } -static inline struct mem_cgroup *mem_cgroup_from_obj(void *p) -{ - return NULL; -} - static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p) { return NULL; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f29157288b7d..f0297e7c3d12 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2446,37 +2446,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p) /* * Returns a pointer to the memory cgroup to which the kernel object is charged. - * - * A passed kernel object can be a slab object, vmalloc object or a generic - * kernel page, so different mechanisms for getting the memory cgroup pointer - * should be used. - * - * In certain cases (e.g. kernel stacks or large kmallocs with SLUB) the caller - * can not know for sure how the kernel object is implemented. - * mem_cgroup_from_obj() can be safely used in such cases. - * - * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(), - * cgroup_mutex, etc. - */ -struct mem_cgroup *mem_cgroup_from_obj(void *p) -{ - struct folio *folio; - - if (mem_cgroup_disabled()) - return NULL; - - if (unlikely(is_vmalloc_addr(p))) - folio = page_folio(vmalloc_to_page(p)); - else - folio = virt_to_folio(p); - - return mem_cgroup_from_obj_folio(folio, p); -} - -/* - * Returns a pointer to the memory cgroup to which the kernel object is charged. - * Similar to mem_cgroup_from_obj(), but faster and not suitable for objects, - * allocated using vmalloc(). + * It is not suitable for objects allocated using vmalloc(). * * A passed kernel object must be a slab object or a generic kernel page. * From d2539ed7ee3b042e4503c304603d0eaa50c9c476 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Wed, 24 Jul 2024 14:00:56 +1200 Subject: [PATCH 014/416] mm: extend 'usage' parameter so that cluster_swap_free_nr() can be reused Extend a usage parameter so that cluster_swap_free_nr() can be reused by both swapcache_clear() and swap_free(). __swap_entry_free() is quite similar but more tricky as it requires the return value of __swap_entry_free_locked() which cluster_swap_free_nr() doesn't support. Link: https://lkml.kernel.org/r/20240724020056.65838-1-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: Ryan Roberts Acked-by: Chris Li Cc: "Huang, Ying" Cc: Kairui Song Cc: David Hildenbrand Cc: Chuanhua Han Signed-off-by: Andrew Morton --- mm/swapfile.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 38bdc439651a..5f73a8553371 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1344,7 +1344,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) } static void cluster_swap_free_nr(struct swap_info_struct *sis, - unsigned long offset, int nr_pages) + unsigned long offset, int nr_pages, + unsigned char usage) { struct swap_cluster_info *ci; DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 }; @@ -1354,7 +1355,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis, while (nr_pages) { nr = min(BITS_PER_LONG, nr_pages); for (i = 0; i < nr; i++) { - if (!__swap_entry_free_locked(sis, offset + i, 1)) + if (!__swap_entry_free_locked(sis, offset + i, usage)) bitmap_set(to_free, i, 1); } if (!bitmap_empty(to_free, BITS_PER_LONG)) { @@ -1388,7 +1389,7 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) while (nr_pages) { nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - cluster_swap_free_nr(sis, offset, nr); + cluster_swap_free_nr(sis, offset, nr, 1); offset += nr; nr_pages -= nr; } @@ -3472,15 +3473,9 @@ int swapcache_prepare(swp_entry_t entry) void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) { - struct swap_cluster_info *ci; unsigned long offset = swp_offset(entry); - unsigned char usage; - ci = lock_cluster_or_swap_info(si, offset); - usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); - unlock_cluster_or_swap_info(si, ci); - if (!usage) - free_swap_slot(entry); + cluster_swap_free_nr(si, offset, 1, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) From 3eb2091c653480e02e9c2a114c4725ec654bca63 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 24 Jul 2024 09:01:13 -0400 Subject: [PATCH 015/416] memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page() Patch series "Various memory tiering fixes", v3. This patch (of 3): last_cpupid is only available when memory tiering is off or the folio is in toptier node. Complete the check to read last_cpupid when it is available. Before the fix, the default last_cpupid will be used even if memory tiering mode is turned off at runtime instead of the actual value. This can prevent task_numa_fault() from getting right numa fault stats, but should not cause any crash. User might see performance changes after the fix. Link: https://lkml.kernel.org/r/20240724130115.793641-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20240724130115.793641-2-ziy@nvidia.com Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency") Signed-off-by: Zi Yan Reported-by: David Hildenbrand Closes: https://lore.kernel.org/linux-mm/9af34a6b-ca56-4a64-8aa6-ade65f109288@redhat.com/ Reviewed-by: "Huang, Ying" Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Reviewed-by: Kefeng Wang Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1ec32c02b19c..632a92edec4e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1707,7 +1707,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. */ - if (node_is_toptier(nid)) + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) || + node_is_toptier(nid)) last_cpupid = folio_last_cpupid(folio); target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags); if (target_nid == NUMA_NO_NODE) From 2a28713a67fd28eedc13b32300df6b9c5a2381f5 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 24 Jul 2024 09:01:14 -0400 Subject: [PATCH 016/416] memory tiering: introduce folio_use_access_time() check If memory tiering mode is on and a folio is not in the top tier memory, folio's cpupid field is repurposed to store page access time. Instead of an open coded check, use a function to encapsulate the check. Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com Signed-off-by: Zi Yan Reviewed-by: "Huang, Ying" Acked-by: David Hildenbrand Reviewed-by: Kefeng Wang Cc: Baolin Wang Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- include/linux/mm.h | 6 ++++++ kernel/sched/fair.c | 3 +-- mm/huge_memory.c | 6 ++---- mm/memory-tiers.c | 19 +++++++++++++++++++ mm/memory.c | 3 +-- mm/mprotect.c | 3 +-- 6 files changed, 30 insertions(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6549d0979b28..6c2078ca3391 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1745,6 +1745,8 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) __set_bit(pid_bit, &vma->numab_state->pids_active[1]); } } + +bool folio_use_access_time(struct folio *folio); #else /* !CONFIG_NUMA_BALANCING */ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid) { @@ -1798,6 +1800,10 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) { } +static inline bool folio_use_access_time(struct folio *folio) +{ + return false; +} #endif /* CONFIG_NUMA_BALANCING */ #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9057584ec06d..416e29b56cc4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1840,8 +1840,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, * The pages in slow memory node should be migrated according * to hot/cold instead of private/shared. */ - if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && - !node_is_toptier(src_nid)) { + if (folio_use_access_time(folio)) { struct pglist_data *pgdat; unsigned long rate_limit; unsigned int latency, th, def_th; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 632a92edec4e..29e76d285e4b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1707,8 +1707,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. */ - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) || - node_is_toptier(nid)) + if (!folio_use_access_time(folio)) last_cpupid = folio_last_cpupid(folio); target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags); if (target_nid == NUMA_NO_NODE) @@ -2058,8 +2057,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, toptier) goto unlock; - if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && - !toptier) + if (folio_use_access_time(folio)) folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); } diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 4775b3a3dabe..2a642ea86cb2 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "internal.h" @@ -50,6 +51,24 @@ static const struct bus_type memory_tier_subsys = { .dev_name = "memory_tier", }; +#ifdef CONFIG_NUMA_BALANCING +/** + * folio_use_access_time - check if a folio reuses cpupid for page access time + * @folio: folio to check + * + * folio's _last_cpupid field is repurposed by memory tiering. In memory + * tiering mode, cpupid of slow memory folio (not toptier memory) is used to + * record page access time. + * + * Return: the folio _last_cpupid is used to record page access time + */ +bool folio_use_access_time(struct folio *folio) +{ + return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && + !node_is_toptier(folio_nid(folio)); +} +#endif + #ifdef CONFIG_MIGRATION static int top_tier_adistance; /* diff --git a/mm/memory.c b/mm/memory.c index 3c01d68065be..a3d955098a19 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5337,8 +5337,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. */ - if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && - !node_is_toptier(nid)) + if (folio_use_access_time(folio)) last_cpupid = (-1 & LAST_CPUPID_MASK); else last_cpupid = folio_last_cpupid(folio); diff --git a/mm/mprotect.c b/mm/mprotect.c index 222ab434da54..37cf8d249405 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -161,8 +161,7 @@ static long change_pte_range(struct mmu_gather *tlb, if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && toptier) continue; - if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && - !toptier) + if (folio_use_access_time(folio)) folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); } From ac59a1f0146f46bad7d5f8d1b20756ece43122ec Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 24 Jul 2024 09:01:15 -0400 Subject: [PATCH 017/416] memory tiering: count PGPROMOTE_SUCCESS when mem tiering is enabled. memory tiering can be enabled/disabled at runtime and sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING is used to check it. In migrate_misplaced_folio(), the check is missing when PGPROMOTE_SUCCESS is incremented. Add the missing check. Link: https://lkml.kernel.org/r/20240724130115.793641-4-ziy@nvidia.com Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency") Reported-by: Kefeng Wang Closes: https://lore.kernel.org/linux-mm/f4ae2c9c-fe40-4807-bdb2-64cf2d716c1a@huawei.com/ Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Kefeng Wang Cc: Baolin Wang Cc: "Huang, Ying" Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 923ea80ba744..8578a930cad1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2636,7 +2636,9 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, putback_movable_pages(&migratepages); if (nr_succeeded) { count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); - if (!node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) + if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) + && !node_is_toptier(folio_nid(folio)) + && node_is_toptier(node)) mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); } From b03484c2a7a260cafe1bb1cb65cbcf66f2c4b0eb Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 10 Jul 2024 20:13:13 -0600 Subject: [PATCH 018/416] mm/swap: reduce indentation level Patch series "mm/swap: remove boilerplate". This patch (of 5): Use folio_activate() as an example: Before this series ------------------ if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { struct folio_batch *fbatch; folio_get(folio); if (!folio_test_clear_lru(folio)) { folio_put(folio); return; } local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.activate); folio_batch_add_and_move(fbatch, folio, folio_activate_fn); local_unlock(&cpu_fbatches.lock); } } After this series ----------------- void folio_activate(struct folio *folio) { if (folio_test_active(folio) || folio_test_unevictable(folio)) return; folio_batch_add_and_move(folio, lru_activate, true); } And this is applied to all 6 folio_batch handlers in mm/swap.c. bloat-o-meter ------------- add/remove: 12/13 grow/shrink: 3/2 up/down: 4653/-4721 (-68) ... Total: Before=28083019, After=28082951, chg -0.00% This patch (of 5): Reduce indentation level by returning directly when there is no cleanup needed, i.e., if (condition) { | if (condition) { do_this(); | do_this(); return; | return; } else { | } do_that(); | } | do_that(); and if (condition) { | if (!condition) do_this(); | return; do_that(); | } | do_this(); return; | do_that(); Presumably the old style became repetitive as the result of copy and paste. Link: https://lkml.kernel.org/r/20240711021317.596178-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20240711021317.596178-2-yuzhao@google.com Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton --- mm/swap.c | 205 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 107 insertions(+), 98 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 9caf6b017cf0..952e4aac6eb1 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -117,7 +117,9 @@ void __folio_put(struct folio *folio) if (unlikely(folio_is_zone_device(folio))) { free_zone_device_folio(folio); return; - } else if (folio_test_hugetlb(folio)) { + } + + if (folio_test_hugetlb(folio)) { free_huge_folio(folio); return; } @@ -228,17 +230,19 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch, if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) && !lru_cache_disabled()) return; + folio_batch_move_lru(fbatch, move_fn); } static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio) { - if (!folio_test_unevictable(folio)) { - lruvec_del_folio(lruvec, folio); - folio_clear_active(folio); - lruvec_add_folio_tail(lruvec, folio); - __count_vm_events(PGROTATED, folio_nr_pages(folio)); - } + if (folio_test_unevictable(folio)) + return; + + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + lruvec_add_folio_tail(lruvec, folio); + __count_vm_events(PGROTATED, folio_nr_pages(folio)); } /* @@ -250,22 +254,23 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio) */ void folio_rotate_reclaimable(struct folio *folio) { - if (!folio_test_locked(folio) && !folio_test_dirty(folio) && - !folio_test_unevictable(folio)) { - struct folio_batch *fbatch; - unsigned long flags; + struct folio_batch *fbatch; + unsigned long flags; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } + if (folio_test_locked(folio) || folio_test_dirty(folio) || + folio_test_unevictable(folio)) + return; - local_lock_irqsave(&lru_rotate.lock, flags); - fbatch = this_cpu_ptr(&lru_rotate.fbatch); - folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); - local_unlock_irqrestore(&lru_rotate.lock, flags); + folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; } + + local_lock_irqsave(&lru_rotate.lock, flags); + fbatch = this_cpu_ptr(&lru_rotate.fbatch); + folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); + local_unlock_irqrestore(&lru_rotate.lock, flags); } void lru_note_cost(struct lruvec *lruvec, bool file, @@ -328,18 +333,19 @@ void lru_note_cost_refault(struct folio *folio) static void folio_activate_fn(struct lruvec *lruvec, struct folio *folio) { - if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { - long nr_pages = folio_nr_pages(folio); + long nr_pages = folio_nr_pages(folio); - lruvec_del_folio(lruvec, folio); - folio_set_active(folio); - lruvec_add_folio(lruvec, folio); - trace_mm_lru_activate(folio); + if (folio_test_active(folio) || folio_test_unevictable(folio)) + return; - __count_vm_events(PGACTIVATE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, - nr_pages); - } + + lruvec_del_folio(lruvec, folio); + folio_set_active(folio); + lruvec_add_folio(lruvec, folio); + trace_mm_lru_activate(folio); + + __count_vm_events(PGACTIVATE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, nr_pages); } #ifdef CONFIG_SMP @@ -353,20 +359,21 @@ static void folio_activate_drain(int cpu) void folio_activate(struct folio *folio) { - if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { - struct folio_batch *fbatch; + struct folio_batch *fbatch; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } + if (folio_test_active(folio) || folio_test_unevictable(folio)) + return; - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.activate); - folio_batch_add_and_move(fbatch, folio, folio_activate_fn); - local_unlock(&cpu_fbatches.lock); + folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; } + + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.activate); + folio_batch_add_and_move(fbatch, folio, folio_activate_fn); + local_unlock(&cpu_fbatches.lock); } #else @@ -378,12 +385,13 @@ void folio_activate(struct folio *folio) { struct lruvec *lruvec; - if (folio_test_clear_lru(folio)) { - lruvec = folio_lruvec_lock_irq(folio); - folio_activate_fn(lruvec, folio); - unlock_page_lruvec_irq(lruvec); - folio_set_lru(folio); - } + if (!folio_test_clear_lru(folio)) + return; + + lruvec = folio_lruvec_lock_irq(folio); + folio_activate_fn(lruvec, folio); + unlock_page_lruvec_irq(lruvec); + folio_set_lru(folio); } #endif @@ -610,41 +618,41 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) { - if (!folio_test_unevictable(folio) && (folio_test_active(folio) || lru_gen_enabled())) { - long nr_pages = folio_nr_pages(folio); + long nr_pages = folio_nr_pages(folio); - lruvec_del_folio(lruvec, folio); - folio_clear_active(folio); - folio_clear_referenced(folio); - lruvec_add_folio(lruvec, folio); + if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled())) + return; - __count_vm_events(PGDEACTIVATE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, - nr_pages); - } + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + folio_clear_referenced(folio); + lruvec_add_folio(lruvec, folio); + + __count_vm_events(PGDEACTIVATE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages); } static void lru_lazyfree_fn(struct lruvec *lruvec, struct folio *folio) { - if (folio_test_anon(folio) && folio_test_swapbacked(folio) && - !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) { - long nr_pages = folio_nr_pages(folio); + long nr_pages = folio_nr_pages(folio); - lruvec_del_folio(lruvec, folio); - folio_clear_active(folio); - folio_clear_referenced(folio); - /* - * Lazyfree folios are clean anonymous folios. They have - * the swapbacked flag cleared, to distinguish them from normal - * anonymous folios - */ - folio_clear_swapbacked(folio); - lruvec_add_folio(lruvec, folio); + if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) || + folio_test_swapcache(folio) || folio_test_unevictable(folio)) + return; - __count_vm_events(PGLAZYFREE, nr_pages); - __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, - nr_pages); - } + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + folio_clear_referenced(folio); + /* + * Lazyfree folios are clean anonymous folios. They have + * the swapbacked flag cleared, to distinguish them from normal + * anonymous folios + */ + folio_clear_swapbacked(folio); + lruvec_add_folio(lruvec, folio); + + __count_vm_events(PGLAZYFREE, nr_pages); + __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, nr_pages); } /* @@ -726,21 +734,21 @@ void deactivate_file_folio(struct folio *folio) */ void folio_deactivate(struct folio *folio) { - if (!folio_test_unevictable(folio) && (folio_test_active(folio) || - lru_gen_enabled())) { - struct folio_batch *fbatch; + struct folio_batch *fbatch; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } + if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled())) + return; - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); - folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn); - local_unlock(&cpu_fbatches.lock); + folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; } + + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); + folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn); + local_unlock(&cpu_fbatches.lock); } /** @@ -752,21 +760,22 @@ void folio_deactivate(struct folio *folio) */ void folio_mark_lazyfree(struct folio *folio) { - if (folio_test_anon(folio) && folio_test_swapbacked(folio) && - !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) { - struct folio_batch *fbatch; + struct folio_batch *fbatch; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } + if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) || + folio_test_swapcache(folio) || folio_test_unevictable(folio)) + return; - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); - folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn); - local_unlock(&cpu_fbatches.lock); + folio_get(folio); + if (!folio_test_clear_lru(folio)) { + folio_put(folio); + return; } + + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); + folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn); + local_unlock(&cpu_fbatches.lock); } void lru_add_drain(void) From 380d70549301a3ae6cf5f4ac90d62f648f516ff7 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 10 Jul 2024 20:13:14 -0600 Subject: [PATCH 019/416] mm/swap: rename cpu_fbatches->activate Rename cpu_fbatches->activate to cpu_fbatches->lru_activate, and its handler folio_activate_fn() to lru_activate() so that all the boilerplate can be removed at the end of this series. Link: https://lkml.kernel.org/r/20240711021317.596178-3-yuzhao@google.com Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton --- mm/swap.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 952e4aac6eb1..e4745b88a964 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -67,7 +67,7 @@ struct cpu_fbatches { struct folio_batch lru_deactivate; struct folio_batch lru_lazyfree; #ifdef CONFIG_SMP - struct folio_batch activate; + struct folio_batch lru_activate; #endif }; static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = { @@ -331,7 +331,7 @@ void lru_note_cost_refault(struct folio *folio) folio_nr_pages(folio), 0); } -static void folio_activate_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_activate(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio); @@ -351,10 +351,10 @@ static void folio_activate_fn(struct lruvec *lruvec, struct folio *folio) #ifdef CONFIG_SMP static void folio_activate_drain(int cpu) { - struct folio_batch *fbatch = &per_cpu(cpu_fbatches.activate, cpu); + struct folio_batch *fbatch = &per_cpu(cpu_fbatches.lru_activate, cpu); if (folio_batch_count(fbatch)) - folio_batch_move_lru(fbatch, folio_activate_fn); + folio_batch_move_lru(fbatch, lru_activate); } void folio_activate(struct folio *folio) @@ -371,8 +371,8 @@ void folio_activate(struct folio *folio) } local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.activate); - folio_batch_add_and_move(fbatch, folio, folio_activate_fn); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_activate); + folio_batch_add_and_move(fbatch, folio, lru_activate); local_unlock(&cpu_fbatches.lock); } @@ -389,7 +389,7 @@ void folio_activate(struct folio *folio) return; lruvec = folio_lruvec_lock_irq(folio); - folio_activate_fn(lruvec, folio); + lru_activate(lruvec, folio); unlock_page_lruvec_irq(lruvec); folio_set_lru(folio); } @@ -490,7 +490,7 @@ void folio_mark_accessed(struct folio *folio) } else if (!folio_test_active(folio)) { /* * If the folio is on the LRU, queue it for activation via - * cpu_fbatches.activate. Otherwise, assume the folio is in a + * cpu_fbatches.lru_activate. Otherwise, assume the folio is in a * folio_batch, mark it active and it'll be moved to the active * LRU on the next drain. */ @@ -829,7 +829,7 @@ static bool cpu_needs_drain(unsigned int cpu) folio_batch_count(&fbatches->lru_deactivate_file) || folio_batch_count(&fbatches->lru_deactivate) || folio_batch_count(&fbatches->lru_lazyfree) || - folio_batch_count(&fbatches->activate) || + folio_batch_count(&fbatches->lru_activate) || need_mlock_drain(cpu) || has_bh_in_lru(cpu, NULL); } From 2f52c77128b1f96b23c987a25dfc2f459634cc07 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 10 Jul 2024 20:13:15 -0600 Subject: [PATCH 020/416] mm/swap: fold lru_rotate into cpu_fbatches Fold lru_rotate into cpu_fbatches, and rename the folio_batch and the lock protecting it to lru_move_tail and lock_irq respectively so that all the boilerplate can be removed at the end of this series. Also remove data_race() around folio_batch_count(), which is out of place: all folio_batch_count() calls on remote cpu_fbatches are subject to data_race(), and therefore data_race() should be inside folio_batch_count(). Link: https://lkml.kernel.org/r/20240711021317.596178-4-yuzhao@google.com Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton --- mm/swap.c | 36 ++++++++++++++++-------------------- 1 file changed, 16 insertions(+), 20 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index e4745b88a964..774ae9eab1e6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -47,20 +47,11 @@ int page_cluster; const int page_cluster_max = 31; -/* Protecting only lru_rotate.fbatch which requires disabling interrupts */ -struct lru_rotate { - local_lock_t lock; - struct folio_batch fbatch; -}; -static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = { - .lock = INIT_LOCAL_LOCK(lock), -}; - -/* - * The following folio batches are grouped together because they are protected - * by disabling preemption (and interrupts remain enabled). - */ struct cpu_fbatches { + /* + * The following folio batches are grouped together because they are protected + * by disabling preemption (and interrupts remain enabled). + */ local_lock_t lock; struct folio_batch lru_add; struct folio_batch lru_deactivate_file; @@ -69,9 +60,14 @@ struct cpu_fbatches { #ifdef CONFIG_SMP struct folio_batch lru_activate; #endif + /* Protecting the following batches which require disabling interrupts */ + local_lock_t lock_irq; + struct folio_batch lru_move_tail; }; + static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = { .lock = INIT_LOCAL_LOCK(lock), + .lock_irq = INIT_LOCAL_LOCK(lock_irq), }; static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp, @@ -267,10 +263,10 @@ void folio_rotate_reclaimable(struct folio *folio) return; } - local_lock_irqsave(&lru_rotate.lock, flags); - fbatch = this_cpu_ptr(&lru_rotate.fbatch); + local_lock_irqsave(&cpu_fbatches.lock_irq, flags); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_move_tail); folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); - local_unlock_irqrestore(&lru_rotate.lock, flags); + local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); } void lru_note_cost(struct lruvec *lruvec, bool file, @@ -668,15 +664,15 @@ void lru_add_drain_cpu(int cpu) if (folio_batch_count(fbatch)) folio_batch_move_lru(fbatch, lru_add_fn); - fbatch = &per_cpu(lru_rotate.fbatch, cpu); + fbatch = &fbatches->lru_move_tail; /* Disabling interrupts below acts as a compiler barrier. */ if (data_race(folio_batch_count(fbatch))) { unsigned long flags; /* No harm done if a racing interrupt already did this */ - local_lock_irqsave(&lru_rotate.lock, flags); + local_lock_irqsave(&cpu_fbatches.lock_irq, flags); folio_batch_move_lru(fbatch, lru_move_tail_fn); - local_unlock_irqrestore(&lru_rotate.lock, flags); + local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); } fbatch = &fbatches->lru_deactivate_file; @@ -825,7 +821,7 @@ static bool cpu_needs_drain(unsigned int cpu) /* Check these in order of likelihood that they're not zero */ return folio_batch_count(&fbatches->lru_add) || - data_race(folio_batch_count(&per_cpu(lru_rotate.fbatch, cpu))) || + folio_batch_count(&fbatches->lru_move_tail) || folio_batch_count(&fbatches->lru_deactivate_file) || folio_batch_count(&fbatches->lru_deactivate) || folio_batch_count(&fbatches->lru_lazyfree) || From bed71b50b0c2dc38c5e94c84dd660add2d1609e0 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 10 Jul 2024 20:13:16 -0600 Subject: [PATCH 021/416] mm/swap: remove remaining _fn suffix Remove remaining _fn suffix from cpu_fbatches handlers, which are already self-explanatory. Link: https://lkml.kernel.org/r/20240711021317.596178-5-yuzhao@google.com Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton --- mm/swap.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 774ae9eab1e6..4a66d2f87f26 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -160,7 +160,7 @@ EXPORT_SYMBOL(put_pages_list); typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio); -static void lru_add_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_add(struct lruvec *lruvec, struct folio *folio) { int was_unevictable = folio_test_clear_unevictable(folio); long nr_pages = folio_nr_pages(folio); @@ -230,7 +230,7 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch, folio_batch_move_lru(fbatch, move_fn); } -static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_move_tail(struct lruvec *lruvec, struct folio *folio) { if (folio_test_unevictable(folio)) return; @@ -265,7 +265,7 @@ void folio_rotate_reclaimable(struct folio *folio) local_lock_irqsave(&cpu_fbatches.lock_irq, flags); fbatch = this_cpu_ptr(&cpu_fbatches.lru_move_tail); - folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); + folio_batch_add_and_move(fbatch, folio, lru_move_tail); local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); } @@ -527,7 +527,7 @@ void folio_add_lru(struct folio *folio) folio_get(folio); local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); - folio_batch_add_and_move(fbatch, folio, lru_add_fn); + folio_batch_add_and_move(fbatch, folio, lru_add); local_unlock(&cpu_fbatches.lock); } EXPORT_SYMBOL(folio_add_lru); @@ -571,7 +571,7 @@ void folio_add_lru_vma(struct folio *folio, struct vm_area_struct *vma) * written out by flusher threads as this is much more efficient * than the single-page writeout from reclaim. */ -static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio) { bool active = folio_test_active(folio); long nr_pages = folio_nr_pages(folio); @@ -612,7 +612,7 @@ static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) } } -static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_deactivate(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio); @@ -628,7 +628,7 @@ static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages); } -static void lru_lazyfree_fn(struct lruvec *lruvec, struct folio *folio) +static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio) { long nr_pages = folio_nr_pages(folio); @@ -662,7 +662,7 @@ void lru_add_drain_cpu(int cpu) struct folio_batch *fbatch = &fbatches->lru_add; if (folio_batch_count(fbatch)) - folio_batch_move_lru(fbatch, lru_add_fn); + folio_batch_move_lru(fbatch, lru_add); fbatch = &fbatches->lru_move_tail; /* Disabling interrupts below acts as a compiler barrier. */ @@ -671,21 +671,21 @@ void lru_add_drain_cpu(int cpu) /* No harm done if a racing interrupt already did this */ local_lock_irqsave(&cpu_fbatches.lock_irq, flags); - folio_batch_move_lru(fbatch, lru_move_tail_fn); + folio_batch_move_lru(fbatch, lru_move_tail); local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); } fbatch = &fbatches->lru_deactivate_file; if (folio_batch_count(fbatch)) - folio_batch_move_lru(fbatch, lru_deactivate_file_fn); + folio_batch_move_lru(fbatch, lru_deactivate_file); fbatch = &fbatches->lru_deactivate; if (folio_batch_count(fbatch)) - folio_batch_move_lru(fbatch, lru_deactivate_fn); + folio_batch_move_lru(fbatch, lru_deactivate); fbatch = &fbatches->lru_lazyfree; if (folio_batch_count(fbatch)) - folio_batch_move_lru(fbatch, lru_lazyfree_fn); + folio_batch_move_lru(fbatch, lru_lazyfree); folio_activate_drain(cpu); } @@ -716,7 +716,7 @@ void deactivate_file_folio(struct folio *folio) local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file); - folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn); + folio_batch_add_and_move(fbatch, folio, lru_deactivate_file); local_unlock(&cpu_fbatches.lock); } @@ -743,7 +743,7 @@ void folio_deactivate(struct folio *folio) local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); - folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn); + folio_batch_add_and_move(fbatch, folio, lru_deactivate); local_unlock(&cpu_fbatches.lock); } @@ -770,7 +770,7 @@ void folio_mark_lazyfree(struct folio *folio) local_lock(&cpu_fbatches.lock); fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); - folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn); + folio_batch_add_and_move(fbatch, folio, lru_lazyfree); local_unlock(&cpu_fbatches.lock); } From afb6d780b9b17914ba720088dac2bfa2505f36c8 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 10 Jul 2024 20:13:17 -0600 Subject: [PATCH 022/416] mm/swap: remove boilerplate Remove boilerplate by using a macro to choose the corresponding lock and handler for each folio_batch in cpu_fbatches. [yuzhao@google.com: handle zero-length local_lock_t] Link: https://lkml.kernel.org/r/Zq_0X04WsqgUnz30@google.com [yuzhao@google.com: fix "BUG: using smp_processor_id() in preemptible"] Link: https://lkml.kernel.org/r/ZqNHHMiHn-9vy_II@google.com Link: https://lkml.kernel.org/r/20240711021317.596178-6-yuzhao@google.com Signed-off-by: Yu Zhao Tested-by: Hugh Dickins Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton --- mm/swap.c | 113 +++++++++++++++++++----------------------------------- 1 file changed, 39 insertions(+), 74 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 4a66d2f87f26..67a246772811 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -220,16 +220,43 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) folios_put(fbatch); } -static void folio_batch_add_and_move(struct folio_batch *fbatch, - struct folio *folio, move_fn_t move_fn) +static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, + struct folio *folio, move_fn_t move_fn, + bool on_lru, bool disable_irq) { - if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) && - !lru_cache_disabled()) - return; + unsigned long flags; - folio_batch_move_lru(fbatch, move_fn); + folio_get(folio); + + if (on_lru && !folio_test_clear_lru(folio)) { + folio_put(folio); + return; + } + + if (disable_irq) + local_lock_irqsave(&cpu_fbatches.lock_irq, flags); + else + local_lock(&cpu_fbatches.lock); + + if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) || + lru_cache_disabled()) + folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn); + + if (disable_irq) + local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); + else + local_unlock(&cpu_fbatches.lock); } +#define folio_batch_add_and_move(folio, op, on_lru) \ + __folio_batch_add_and_move( \ + &cpu_fbatches.op, \ + folio, \ + op, \ + on_lru, \ + offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq) \ + ) + static void lru_move_tail(struct lruvec *lruvec, struct folio *folio) { if (folio_test_unevictable(folio)) @@ -250,23 +277,11 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio) */ void folio_rotate_reclaimable(struct folio *folio) { - struct folio_batch *fbatch; - unsigned long flags; - if (folio_test_locked(folio) || folio_test_dirty(folio) || folio_test_unevictable(folio)) return; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } - - local_lock_irqsave(&cpu_fbatches.lock_irq, flags); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_move_tail); - folio_batch_add_and_move(fbatch, folio, lru_move_tail); - local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags); + folio_batch_add_and_move(folio, lru_move_tail, true); } void lru_note_cost(struct lruvec *lruvec, bool file, @@ -355,21 +370,10 @@ static void folio_activate_drain(int cpu) void folio_activate(struct folio *folio) { - struct folio_batch *fbatch; - if (folio_test_active(folio) || folio_test_unevictable(folio)) return; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } - - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_activate); - folio_batch_add_and_move(fbatch, folio, lru_activate); - local_unlock(&cpu_fbatches.lock); + folio_batch_add_and_move(folio, lru_activate, true); } #else @@ -513,8 +517,6 @@ EXPORT_SYMBOL(folio_mark_accessed); */ void folio_add_lru(struct folio *folio) { - struct folio_batch *fbatch; - VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); @@ -524,11 +526,7 @@ void folio_add_lru(struct folio *folio) lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) folio_set_active(folio); - folio_get(folio); - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); - folio_batch_add_and_move(fbatch, folio, lru_add); - local_unlock(&cpu_fbatches.lock); + folio_batch_add_and_move(folio, lru_add, false); } EXPORT_SYMBOL(folio_add_lru); @@ -702,22 +700,11 @@ void lru_add_drain_cpu(int cpu) */ void deactivate_file_folio(struct folio *folio) { - struct folio_batch *fbatch; - /* Deactivating an unevictable folio will not accelerate reclaim */ if (folio_test_unevictable(folio)) return; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } - - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file); - folio_batch_add_and_move(fbatch, folio, lru_deactivate_file); - local_unlock(&cpu_fbatches.lock); + folio_batch_add_and_move(folio, lru_deactivate_file, true); } /* @@ -730,21 +717,10 @@ void deactivate_file_folio(struct folio *folio) */ void folio_deactivate(struct folio *folio) { - struct folio_batch *fbatch; - if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled())) return; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } - - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); - folio_batch_add_and_move(fbatch, folio, lru_deactivate); - local_unlock(&cpu_fbatches.lock); + folio_batch_add_and_move(folio, lru_deactivate, true); } /** @@ -756,22 +732,11 @@ void folio_deactivate(struct folio *folio) */ void folio_mark_lazyfree(struct folio *folio) { - struct folio_batch *fbatch; - if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) || folio_test_swapcache(folio) || folio_test_unevictable(folio)) return; - folio_get(folio); - if (!folio_test_clear_lru(folio)) { - folio_put(folio); - return; - } - - local_lock(&cpu_fbatches.lock); - fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); - folio_batch_add_and_move(fbatch, folio, lru_lazyfree); - local_unlock(&cpu_fbatches.lock); + folio_batch_add_and_move(folio, lru_lazyfree, true); } void lru_add_drain(void) From c495b97624d0c059b0403e26dadb166d69918409 Mon Sep 17 00:00:00 2001 From: Zhiguo Jiang Date: Wed, 10 Jul 2024 16:36:41 +0800 Subject: [PATCH 023/416] mm: shrink skip folio mapped by an exiting process The releasing process of the non-shared anonymous folio mapped solely by an exiting process may go through two flows: 1) the anonymous folio is firstly is swaped-out into swapspace and transformed into a swp_entry in shrink_folio_list; 2) then the swp_entry is released in the process exiting flow. This will result in the high cpu load of releasing a non-shared anonymous folio mapped solely by an exiting process. When the low system memory and the exiting process exist at the same time, it will be likely to happen, because the non-shared anonymous folio mapped solely by an exiting process may be reclaimed by shrink_folio_list. This patch is that shrink skips the non-shared anonymous folio solely mapped by an exting process and this folio is only released directly in the process exiting flow, which will save swap-out time and alleviate the load of the process exiting. Barry provided some effectiveness testing in [1]. "I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during the process exit. The working set size is 256MB." Link: https://lkml.kernel.org/r/20240710083641.546-1-justinjiang@vivo.com Link: https://lore.kernel.org/linux-mm/20240710033212.36497-1-21cnbao@gmail.com/ [1] Signed-off-by: Zhiguo Jiang Acked-by: Barry Song Cc: David Hildenbrand Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/rmap.c | 15 +++++++++++++++ mm/vmscan.c | 7 ++++++- 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 2490e727e2dc..2630bde38640 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -75,6 +75,7 @@ #include #include #include +#include #include @@ -870,6 +871,20 @@ static bool folio_referenced_one(struct folio *folio, continue; } + /* + * Skip the non-shared swapbacked folio mapped solely by + * the exiting or OOM-reaped process. This avoids redundant + * swap-out followed by an immediate unmap. + */ + if ((!atomic_read(&vma->vm_mm->mm_users) || + check_stable_address_space(vma->vm_mm)) && + folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_likely_mapped_shared(folio)) { + pra->referenced = -1; + page_vma_mapped_walk_done(&pvmw); + return false; + } + if (pvmw.pte) { if (lru_gen_enabled() && pte_young(ptep_get(pvmw.pte))) { diff --git a/mm/vmscan.c b/mm/vmscan.c index c9682c8a1d75..852d97c134df 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -863,7 +863,12 @@ static enum folio_references folio_check_references(struct folio *folio, if (vm_flags & VM_LOCKED) return FOLIOREF_ACTIVATE; - /* rmap lock contention: rotate */ + /* + * There are two cases to consider. + * 1) Rmap lock contention: rotate. + * 2) Skip the non-shared swapbacked folio mapped solely by + * the exiting or OOM-reaped process. + */ if (referenced_ptes == -1) return FOLIOREF_KEEP; From 9db298a439f2be7b32d48c83c3349df62256ef3a Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 24 Jul 2024 20:33:20 +0000 Subject: [PATCH 024/416] memcg: increase the valid index range for memcg stats Patch series "Kernel stack usage histogram", v6. Provide histogram of stack sizes for the exited threads: Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 This patch (of 3): At the moment the valid index for the indirection tables for memcg stats and events is < S8_MAX. These indirection tables are used in performance critical codepaths. With the latest addition to the vm_events, the NR_VM_EVENT_ITEMS has gone over S8_MAX. One way to resolve is to increase the entry size of the indirection table from int8_t to int16_t but this will increase the potential number of cachelines needed to access the indirection table. This patch took a different approach and make the valid index < U8_MAX. In this way the size of the indirection tables will remain same and we only need to invalid index check from less than 0 to equal to U8_MAX. In this approach we have also removed a subtraction from the performance critical codepaths. [pasha.tatashin@soleen.com: v6] Link: https://lkml.kernel.org/r/20240730150158.832783-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-2-pasha.tatashin@soleen.com Signed-off-by: Shakeel Butt Co-developed-by: Pasha Tatashin Signed-off-by: Pasha Tatashin Cc: Domenico Cerasuolo Cc: Kent Overstreet Cc: Li Zhijian Cc: Matthew Wilcox Cc: Nhat Pham Cc: Peter Zijlstra Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memcontrol.c | 50 +++++++++++++++++++++++++++---------------------- 1 file changed, 28 insertions(+), 22 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f0297e7c3d12..c0ee0f2f8176 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -320,24 +320,27 @@ static const unsigned int memcg_stat_items[] = { #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items) #define MEMCG_VMSTAT_SIZE (NR_MEMCG_NODE_STAT_ITEMS + \ ARRAY_SIZE(memcg_stat_items)) -static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly; +#define BAD_STAT_IDX(index) ((u32)(index) >= U8_MAX) +static u8 mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly; static void init_memcg_stats(void) { - int8_t i, j = 0; + u8 i, j = 0; - BUILD_BUG_ON(MEMCG_NR_STAT >= S8_MAX); + BUILD_BUG_ON(MEMCG_NR_STAT >= U8_MAX); - for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i) - mem_cgroup_stats_index[memcg_node_stat_items[i]] = ++j; + memset(mem_cgroup_stats_index, U8_MAX, sizeof(mem_cgroup_stats_index)); - for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i) - mem_cgroup_stats_index[memcg_stat_items[i]] = ++j; + for (i = 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i, ++j) + mem_cgroup_stats_index[memcg_node_stat_items[i]] = j; + + for (i = 0; i < ARRAY_SIZE(memcg_stat_items); ++i, ++j) + mem_cgroup_stats_index[memcg_stat_items[i]] = j; } static inline int memcg_stats_index(int idx) { - return mem_cgroup_stats_index[idx] - 1; + return mem_cgroup_stats_index[idx]; } struct lruvec_stats_percpu { @@ -369,7 +372,7 @@ unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx) return node_page_state(lruvec_pgdat(lruvec), idx); i = memcg_stats_index(idx); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return 0; pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); @@ -392,7 +395,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); i = memcg_stats_index(idx); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return 0; pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); @@ -435,21 +438,24 @@ static const unsigned int memcg_vm_event_stat[] = { }; #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat) -static int8_t mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly; +static u8 mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly; static void init_memcg_events(void) { - int8_t i; + u8 i; - BUILD_BUG_ON(NR_VM_EVENT_ITEMS >= S8_MAX); + BUILD_BUG_ON(NR_VM_EVENT_ITEMS >= U8_MAX); + + memset(mem_cgroup_events_index, U8_MAX, + sizeof(mem_cgroup_events_index)); for (i = 0; i < NR_MEMCG_EVENTS; ++i) - mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1; + mem_cgroup_events_index[memcg_vm_event_stat[i]] = i; } static inline int memcg_events_index(enum vm_event_item idx) { - return mem_cgroup_events_index[idx] - 1; + return mem_cgroup_events_index[idx]; } struct memcg_vmstats_percpu { @@ -621,7 +627,7 @@ unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) long x; int i = memcg_stats_index(idx); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return 0; x = READ_ONCE(memcg->vmstats->state[i]); @@ -662,7 +668,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, if (mem_cgroup_disabled()) return; - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; __this_cpu_add(memcg->vmstats_percpu->state[i], val); @@ -675,7 +681,7 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) long x; int i = memcg_stats_index(idx); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return 0; x = READ_ONCE(memcg->vmstats->state_local[i]); @@ -694,7 +700,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec, struct mem_cgroup *memcg; int i = memcg_stats_index(idx); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); @@ -810,7 +816,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (mem_cgroup_disabled()) return; - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, idx)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx)) return; memcg_stats_lock(); @@ -823,7 +829,7 @@ unsigned long memcg_events(struct mem_cgroup *memcg, int event) { int i = memcg_events_index(event); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, event)) return 0; return READ_ONCE(memcg->vmstats->events[i]); @@ -833,7 +839,7 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { int i = memcg_events_index(event); - if (WARN_ONCE(i < 0, "%s: missing stat item %d\n", __func__, event)) + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, event)) return 0; return READ_ONCE(memcg->vmstats->events_local[i]); From c4a6fce85640beb4f79e8217272d1ef79839cc17 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Wed, 24 Jul 2024 20:33:21 +0000 Subject: [PATCH 025/416] vmstat: kernel stack usage histogram As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1]. Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact. The histogram data is presented in /proc/vmstat with entries like "kstack_1k", "kstack_2k", and so on, indicating the number of threads that exited with stack usage falling within each respective bucket. Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 Note: once the dynamic kernel stack is implemented it will depend on the implementation the usability of this feature: On hardware that supports faults on kernel stacks, we will have other metrics that show the total number of pages allocated for stacks. On hardware where faults are not supported, we will most likely have some optimization where only some threads are extended, and for those, these metrics will still be very useful. [1] https://lwn.net/Articles/974367 Link: https://lkml.kernel.org/r/20240730150158.832783-3-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Kent Overstreet Acked-by: Shakeel Butt Cc: Domenico Cerasuolo Cc: Li Zhijian Cc: Matthew Wilcox Cc: Nhat Pham Cc: Peter Zijlstra Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/vm_event_item.h | 24 ++++++++++++++++++++++ kernel/exit.c | 38 +++++++++++++++++++++++++++++++++++ mm/vmstat.c | 24 ++++++++++++++++++++++ 3 files changed, 86 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 6fff8ed3b1fd..aae5c7c5cfb4 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -155,6 +155,30 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, VMA_LOCK_RETRY, VMA_LOCK_MISS, #endif +#ifdef CONFIG_DEBUG_STACK_USAGE + KSTACK_1K, +#if THREAD_SIZE > 1024 + KSTACK_2K, +#endif +#if THREAD_SIZE > 2048 + KSTACK_4K, +#endif +#if THREAD_SIZE > 4096 + KSTACK_8K, +#endif +#if THREAD_SIZE > 8192 + KSTACK_16K, +#endif +#if THREAD_SIZE > 16384 + KSTACK_32K, +#endif +#if THREAD_SIZE > 32768 + KSTACK_64K, +#endif +#if THREAD_SIZE > 65536 + KSTACK_REST, +#endif +#endif /* CONFIG_DEBUG_STACK_USAGE */ NR_VM_EVENT_ITEMS }; diff --git a/kernel/exit.c b/kernel/exit.c index 7430852a8571..64bfc2bae55b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -778,6 +778,43 @@ static void exit_notify(struct task_struct *tsk, int group_dead) } #ifdef CONFIG_DEBUG_STACK_USAGE +/* Count the maximum pages reached in kernel stacks */ +static inline void kstack_histogram(unsigned long used_stack) +{ +#ifdef CONFIG_VM_EVENT_COUNTERS + if (used_stack <= 1024) + count_vm_event(KSTACK_1K); +#if THREAD_SIZE > 1024 + else if (used_stack <= 2048) + count_vm_event(KSTACK_2K); +#endif +#if THREAD_SIZE > 2048 + else if (used_stack <= 4096) + count_vm_event(KSTACK_4K); +#endif +#if THREAD_SIZE > 4096 + else if (used_stack <= 8192) + count_vm_event(KSTACK_8K); +#endif +#if THREAD_SIZE > 8192 + else if (used_stack <= 16384) + count_vm_event(KSTACK_16K); +#endif +#if THREAD_SIZE > 16384 + else if (used_stack <= 32768) + count_vm_event(KSTACK_32K); +#endif +#if THREAD_SIZE > 32768 + else if (used_stack <= 65536) + count_vm_event(KSTACK_64K); +#endif +#if THREAD_SIZE > 65536 + else + count_vm_event(KSTACK_REST); +#endif +#endif /* CONFIG_VM_EVENT_COUNTERS */ +} + static void check_stack_usage(void) { static DEFINE_SPINLOCK(low_water_lock); @@ -785,6 +822,7 @@ static void check_stack_usage(void) unsigned long free; free = stack_not_used(current); + kstack_histogram(THREAD_SIZE - free); if (free >= lowest_to_date) return; diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ac2a569b6cf..2dc97baf7d62 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1436,6 +1436,30 @@ const char * const vmstat_text[] = { "vma_lock_retry", "vma_lock_miss", #endif +#ifdef CONFIG_DEBUG_STACK_USAGE + "kstack_1k", +#if THREAD_SIZE > 1024 + "kstack_2k", +#endif +#if THREAD_SIZE > 2048 + "kstack_4k", +#endif +#if THREAD_SIZE > 4096 + "kstack_8k", +#endif +#if THREAD_SIZE > 8192 + "kstack_16k", +#endif +#if THREAD_SIZE > 16384 + "kstack_32k", +#endif +#if THREAD_SIZE > 32768 + "kstack_64k", +#endif +#if THREAD_SIZE > 65536 + "kstack_rest", +#endif +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ From fbe76a6557a83af5ef3819fd7b7ffd0a5d3b4e51 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Wed, 24 Jul 2024 20:33:22 +0000 Subject: [PATCH 026/416] task_stack: uninline stack_not_used Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/20240730150158.832783-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Acked-by: Shakeel Butt Cc: Domenico Cerasuolo Cc: Kent Overstreet Cc: Li Zhijian Cc: Matthew Wilcox Cc: Nhat Pham Cc: Peter Zijlstra Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/sched/task_stack.h | 18 +++--------------- kernel/exit.c | 19 +++++++++++++++++++ kernel/sched/core.c | 4 +--- 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h index ccd72b978e1f..bf10bdb487dd 100644 --- a/include/linux/sched/task_stack.h +++ b/include/linux/sched/task_stack.h @@ -95,23 +95,11 @@ static inline int object_is_on_stack(const void *obj) extern void thread_stack_cache_init(void); #ifdef CONFIG_DEBUG_STACK_USAGE +unsigned long stack_not_used(struct task_struct *p); +#else static inline unsigned long stack_not_used(struct task_struct *p) { - unsigned long *n = end_of_stack(p); - - do { /* Skip over canary */ -# ifdef CONFIG_STACK_GROWSUP - n--; -# else - n++; -# endif - } while (!*n); - -# ifdef CONFIG_STACK_GROWSUP - return (unsigned long)end_of_stack(p) - (unsigned long)n; -# else - return (unsigned long)n - (unsigned long)end_of_stack(p); -# endif + return 0; } #endif extern void set_task_stack_end_magic(struct task_struct *tsk); diff --git a/kernel/exit.c b/kernel/exit.c index 64bfc2bae55b..45085a0e7c16 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -778,6 +778,25 @@ static void exit_notify(struct task_struct *tsk, int group_dead) } #ifdef CONFIG_DEBUG_STACK_USAGE +unsigned long stack_not_used(struct task_struct *p) +{ + unsigned long *n = end_of_stack(p); + + do { /* Skip over canary */ +# ifdef CONFIG_STACK_GROWSUP + n--; +# else + n++; +# endif + } while (!*n); + +# ifdef CONFIG_STACK_GROWSUP + return (unsigned long)end_of_stack(p) - (unsigned long)n; +# else + return (unsigned long)n - (unsigned long)end_of_stack(p); +# endif +} + /* Count the maximum pages reached in kernel stacks */ static inline void kstack_histogram(unsigned long used_stack) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f3951e4a55e5..43e701f54013 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7405,7 +7405,7 @@ EXPORT_SYMBOL(io_schedule); void sched_show_task(struct task_struct *p) { - unsigned long free = 0; + unsigned long free; int ppid; if (!try_get_task_stack(p)) @@ -7415,9 +7415,7 @@ void sched_show_task(struct task_struct *p) if (task_is_running(p)) pr_cont(" running task "); -#ifdef CONFIG_DEBUG_STACK_USAGE free = stack_not_used(p); -#endif ppid = 0; rcu_read_lock(); if (pid_alive(p)) From 6c99d4eb7c5e0999349cdb9d824ea0ac450d0c8f Mon Sep 17 00:00:00 2001 From: Pavel Tikhomirov Date: Thu, 25 Jul 2024 12:12:15 +0800 Subject: [PATCH 027/416] kmemleak: enable tracking for percpu pointers Patch series "kmemleak: support for percpu memory leak detect'. This is a rework of this series: https://lore.kernel.org/lkml/20200921020007.35803-1-chenjun102@huawei.com/ Originally I was investigating a percpu leak on our customer nodes and having this functionality was a huge help, which lead to this fix [1]. So probably it's a good idea to have it in mainstream too, especially as after [2] it became much easier to implement (we already have a separate tree for percpu pointers). [1] commit 0af8c09c89681 ("netfilter: x_tables: fix percpu counter block leak on error path when creating new netns") [2] commit 39042079a0c24 ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers") This patch (of 2): This basically does: - Add min_percpu_addr and max_percpu_addr to filter out unrelated data similar to min_addr and max_addr; - Set min_count for percpu pointers to 1 to start tracking them; - Calculate checksum of percpu area as xor of crc32 for each cpu; - Split pointer lookup and update refs code into separate helper and use it twice: once as if the pointer is a virtual pointer and once as if it's percpu. [ptikhomirov@virtuozzo.com: v2] Link: https://lkml.kernel.org/r/20240731025526.157529-2-ptikhomirov@virtuozzo.com Link: https://lkml.kernel.org/r/20240725041223.872472-1-ptikhomirov@virtuozzo.com Link: https://lkml.kernel.org/r/20240725041223.872472-2-ptikhomirov@virtuozzo.com Signed-off-by: Pavel Tikhomirov Reviewed-by: Catalin Marinas Cc: Wei Yongjun Cc: Chen Jun Cc: Alexander Mikhalitsyn Signed-off-by: Andrew Morton --- mm/kmemleak.c | 153 +++++++++++++++++++++++++++++++------------------- 1 file changed, 94 insertions(+), 59 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 764b08100570..6b498c6d9c34 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -224,6 +224,10 @@ static int kmemleak_error; static unsigned long min_addr = ULONG_MAX; static unsigned long max_addr; +/* minimum and maximum address that may be valid per-CPU pointers */ +static unsigned long min_percpu_addr = ULONG_MAX; +static unsigned long max_percpu_addr; + static struct task_struct *scan_thread; /* used to avoid reporting of recently allocated objects */ static unsigned long jiffies_min_age; @@ -294,13 +298,20 @@ static void hex_dump_object(struct seq_file *seq, const u8 *ptr = (const u8 *)object->pointer; size_t len; - if (WARN_ON_ONCE(object->flags & (OBJECT_PHYS | OBJECT_PERCPU))) + if (WARN_ON_ONCE(object->flags & OBJECT_PHYS)) return; + if (object->flags & OBJECT_PERCPU) + ptr = (const u8 *)this_cpu_ptr((void __percpu *)object->pointer); + /* limit the number of lines to HEX_MAX_LINES */ len = min_t(size_t, object->size, HEX_MAX_LINES * HEX_ROW_SIZE); - warn_or_seq_printf(seq, " hex dump (first %zu bytes):\n", len); + if (object->flags & OBJECT_PERCPU) + warn_or_seq_printf(seq, " hex dump (first %zu bytes on cpu %d):\n", + len, raw_smp_processor_id()); + else + warn_or_seq_printf(seq, " hex dump (first %zu bytes):\n", len); kasan_disable_current(); warn_or_seq_hex_dump(seq, DUMP_PREFIX_NONE, HEX_ROW_SIZE, HEX_GROUP_SIZE, kasan_reset_tag((void *)ptr), len, HEX_ASCII); @@ -695,10 +706,14 @@ static int __link_object(struct kmemleak_object *object, unsigned long ptr, untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); /* - * Only update min_addr and max_addr with object - * storing virtual address. + * Only update min_addr and max_addr with object storing virtual + * address. And update min_percpu_addr max_percpu_addr for per-CPU + * objects. */ - if (!(objflags & (OBJECT_PHYS | OBJECT_PERCPU))) { + if (objflags & OBJECT_PERCPU) { + min_percpu_addr = min(min_percpu_addr, untagged_ptr); + max_percpu_addr = max(max_percpu_addr, untagged_ptr + size); + } else if (!(objflags & OBJECT_PHYS)) { min_addr = min(min_addr, untagged_ptr); max_addr = max(max_addr, untagged_ptr + size); } @@ -1055,12 +1070,8 @@ void __ref kmemleak_alloc_percpu(const void __percpu *ptr, size_t size, { pr_debug("%s(0x%px, %zu)\n", __func__, ptr, size); - /* - * Percpu allocations are only scanned and not reported as leaks - * (min_count is set to 0). - */ if (kmemleak_enabled && ptr && !IS_ERR(ptr)) - create_object_percpu((unsigned long)ptr, size, 0, gfp); + create_object_percpu((unsigned long)ptr, size, 1, gfp); } EXPORT_SYMBOL_GPL(kmemleak_alloc_percpu); @@ -1304,12 +1315,23 @@ static bool update_checksum(struct kmemleak_object *object) { u32 old_csum = object->checksum; - if (WARN_ON_ONCE(object->flags & (OBJECT_PHYS | OBJECT_PERCPU))) + if (WARN_ON_ONCE(object->flags & OBJECT_PHYS)) return false; kasan_disable_current(); kcsan_disable_current(); - object->checksum = crc32(0, kasan_reset_tag((void *)object->pointer), object->size); + if (object->flags & OBJECT_PERCPU) { + unsigned int cpu; + + object->checksum = 0; + for_each_possible_cpu(cpu) { + void *ptr = per_cpu_ptr((void __percpu *)object->pointer, cpu); + + object->checksum ^= crc32(0, kasan_reset_tag((void *)ptr), object->size); + } + } else { + object->checksum = crc32(0, kasan_reset_tag((void *)object->pointer), object->size); + } kasan_enable_current(); kcsan_enable_current(); @@ -1340,6 +1362,64 @@ static void update_refs(struct kmemleak_object *object) } } +static void pointer_update_refs(struct kmemleak_object *scanned, + unsigned long pointer, unsigned int objflags) +{ + struct kmemleak_object *object; + unsigned long untagged_ptr; + unsigned long excess_ref; + + untagged_ptr = (unsigned long)kasan_reset_tag((void *)pointer); + if (objflags & OBJECT_PERCPU) { + if (untagged_ptr < min_percpu_addr || untagged_ptr >= max_percpu_addr) + return; + } else { + if (untagged_ptr < min_addr || untagged_ptr >= max_addr) + return; + } + + /* + * No need for get_object() here since we hold kmemleak_lock. + * object->use_count cannot be dropped to 0 while the object + * is still present in object_tree_root and object_list + * (with updates protected by kmemleak_lock). + */ + object = __lookup_object(pointer, 1, objflags); + if (!object) + return; + if (object == scanned) + /* self referenced, ignore */ + return; + + /* + * Avoid the lockdep recursive warning on object->lock being + * previously acquired in scan_object(). These locks are + * enclosed by scan_mutex. + */ + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + /* only pass surplus references (object already gray) */ + if (color_gray(object)) { + excess_ref = object->excess_ref; + /* no need for update_refs() if object already gray */ + } else { + excess_ref = 0; + update_refs(object); + } + raw_spin_unlock(&object->lock); + + if (excess_ref) { + object = lookup_object(excess_ref, 0); + if (!object) + return; + if (object == scanned) + /* circular reference, ignore */ + return; + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + update_refs(object); + raw_spin_unlock(&object->lock); + } +} + /* * Memory scanning is a long process and it needs to be interruptible. This * function checks whether such interrupt condition occurred. @@ -1372,13 +1452,10 @@ static void scan_block(void *_start, void *_end, unsigned long *start = PTR_ALIGN(_start, BYTES_PER_POINTER); unsigned long *end = _end - (BYTES_PER_POINTER - 1); unsigned long flags; - unsigned long untagged_ptr; raw_spin_lock_irqsave(&kmemleak_lock, flags); for (ptr = start; ptr < end; ptr++) { - struct kmemleak_object *object; unsigned long pointer; - unsigned long excess_ref; if (scan_should_stop()) break; @@ -1387,50 +1464,8 @@ static void scan_block(void *_start, void *_end, pointer = *(unsigned long *)kasan_reset_tag((void *)ptr); kasan_enable_current(); - untagged_ptr = (unsigned long)kasan_reset_tag((void *)pointer); - if (untagged_ptr < min_addr || untagged_ptr >= max_addr) - continue; - - /* - * No need for get_object() here since we hold kmemleak_lock. - * object->use_count cannot be dropped to 0 while the object - * is still present in object_tree_root and object_list - * (with updates protected by kmemleak_lock). - */ - object = lookup_object(pointer, 1); - if (!object) - continue; - if (object == scanned) - /* self referenced, ignore */ - continue; - - /* - * Avoid the lockdep recursive warning on object->lock being - * previously acquired in scan_object(). These locks are - * enclosed by scan_mutex. - */ - raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); - /* only pass surplus references (object already gray) */ - if (color_gray(object)) { - excess_ref = object->excess_ref; - /* no need for update_refs() if object already gray */ - } else { - excess_ref = 0; - update_refs(object); - } - raw_spin_unlock(&object->lock); - - if (excess_ref) { - object = lookup_object(excess_ref, 0); - if (!object) - continue; - if (object == scanned) - /* circular reference, ignore */ - continue; - raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); - update_refs(object); - raw_spin_unlock(&object->lock); - } + pointer_update_refs(scanned, pointer, 0); + pointer_update_refs(scanned, pointer, OBJECT_PERCPU); } raw_spin_unlock_irqrestore(&kmemleak_lock, flags); } From e0b2fdb352b7991664b23ae5e15b537cd79a7820 Mon Sep 17 00:00:00 2001 From: Pavel Tikhomirov Date: Thu, 25 Jul 2024 12:12:16 +0800 Subject: [PATCH 028/416] kmemleak-test: add percpu leak Add a per-CPU memory leak, which will be reported like: unreferenced object 0x3efa840195f8 (size 64): comm "modprobe", pid 4667, jiffies 4294688677 hex dump (first 32 bytes on cpu 0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): [] pcpu_alloc+0x3df/0x840 [] kmemleak_test_init+0x2c9/0x2f0 [kmemleak_test] [] do_one_initcall+0x44/0x300 [] do_init_module+0x60/0x240 [] init_module_from_file+0x86/0xc0 [] idempotent_init_module+0x109/0x2a0 [] __x64_sys_finit_module+0x5a/0xb0 [] do_syscall_64+0x7a/0x160 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e Link: https://lkml.kernel.org/r/20240725041223.872472-3-ptikhomirov@virtuozzo.com Signed-off-by: Pavel Tikhomirov Acked-by: Catalin Marinas Cc: Alexander Mikhalitsyn Cc: Chen Jun Cc: Wei Yongjun Signed-off-by: Andrew Morton --- samples/kmemleak/kmemleak-test.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/samples/kmemleak/kmemleak-test.c b/samples/kmemleak/kmemleak-test.c index f7470ed85a79..544c36d51d56 100644 --- a/samples/kmemleak/kmemleak-test.c +++ b/samples/kmemleak/kmemleak-test.c @@ -79,6 +79,8 @@ static int kmemleak_test_init(void) per_cpu(kmemleak_test_pointer, i)); } + pr_info("__alloc_percpu(64, 4) = %p\n", __alloc_percpu(64, 4)); + return 0; } module_init(kmemleak_test_init); From 6c469957cd172c1bcea8c5b77bc711a245b0934f Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Thu, 25 Jul 2024 10:16:43 +0800 Subject: [PATCH 029/416] mm: hugetlb: remove left over comment about follow_huge_foo() The comment is useless after commit 57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask") since all follow_huge_foo() are killed. Link: https://lkml.kernel.org/r/20240725021643.1358536-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: David Hildenbrand Reviewed-by: Vishal Moola (Oracle) Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index aaf508be0a2b..5a32157ca309 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7503,10 +7503,6 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h) #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ -/* - * These functions are overwritable if your architecture needs its own - * behavior. - */ bool isolate_hugetlb(struct folio *folio, struct list_head *list) { bool ret = true; From f77bd4b14ccfd38dfcfe67eecad517b8ec1b7f37 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 26 Jul 2024 20:31:08 +0000 Subject: [PATCH 030/416] mm: memcg: don't call propagate_protected_usage() needlessly Patch series "mm: memcg: page counters optimizations", v3. This patchset contains 3 independent small optimizations of page counters. This patch (of 3): Memory protection (min/low) requires a constant tracking of protected memory usage. propagate_protected_usage() is called on each page counters update and does a number of operations even in cases when the actual memory protection functionality is not supported (e.g. hugetlb cgroups or memcg swap counters). It's obviously inefficient and leads to a waste of CPU cycles. It can be addressed by calling propagate_protected_usage() only for the counters which do support memory guarantees. As of now it's only memcg->memory - the unified memory memcg counter. Link: https://lkml.kernel.org/r/20240726203110.1577216-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/page_counter.h | 8 +++++++- mm/hugetlb_cgroup.c | 4 ++-- mm/memcontrol.c | 16 ++++++++-------- mm/page_counter.c | 17 ++++++++++++++--- 4 files changed, 31 insertions(+), 14 deletions(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 904c52f97284..0b8e993f9163 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -31,6 +31,7 @@ struct page_counter { /* Keep all the read most fields in a separete cacheline. */ CACHELINE_PADDING(_pad2_); + bool protection_support; unsigned long min; unsigned long low; unsigned long high; @@ -44,12 +45,17 @@ struct page_counter { #define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE) #endif +/* + * Protection is supported only for the first counter (with id 0). + */ static inline void page_counter_init(struct page_counter *counter, - struct page_counter *parent) + struct page_counter *parent, + bool protection_support) { atomic_long_set(&counter->usage, 0); counter->max = PAGE_COUNTER_MAX; counter->parent = parent; + counter->protection_support = protection_support; } static inline unsigned long page_counter_read(struct page_counter *counter) diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 4ff238ba1250..e716c4671a15 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -114,10 +114,10 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, } page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup, idx), - fault_parent); + fault_parent, false); page_counter_init( hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx), - rsvd_parent); + rsvd_parent, false); limit = round_down(PAGE_COUNTER_MAX, pages_per_huge_page(&hstates[idx])); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c0ee0f2f8176..f92578f13b2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3596,21 +3596,21 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) if (parent) { WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent)); - page_counter_init(&memcg->memory, &parent->memory); - page_counter_init(&memcg->swap, &parent->swap); + page_counter_init(&memcg->memory, &parent->memory, true); + page_counter_init(&memcg->swap, &parent->swap, false); #ifdef CONFIG_MEMCG_V1 WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable)); - page_counter_init(&memcg->kmem, &parent->kmem); - page_counter_init(&memcg->tcpmem, &parent->tcpmem); + page_counter_init(&memcg->kmem, &parent->kmem, false); + page_counter_init(&memcg->tcpmem, &parent->tcpmem, false); #endif } else { init_memcg_stats(); init_memcg_events(); - page_counter_init(&memcg->memory, NULL); - page_counter_init(&memcg->swap, NULL); + page_counter_init(&memcg->memory, NULL, true); + page_counter_init(&memcg->swap, NULL, false); #ifdef CONFIG_MEMCG_V1 - page_counter_init(&memcg->kmem, NULL); - page_counter_init(&memcg->tcpmem, NULL); + page_counter_init(&memcg->kmem, NULL, false); + page_counter_init(&memcg->tcpmem, NULL, false); #endif root_mem_cgroup = memcg; return &memcg->css; diff --git a/mm/page_counter.c b/mm/page_counter.c index 0153f5bb3161..dd242245bee2 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -13,6 +13,11 @@ #include #include +static bool track_protection(struct page_counter *c) +{ + return c->protection_support; +} + static void propagate_protected_usage(struct page_counter *c, unsigned long usage) { @@ -57,7 +62,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages) new = 0; atomic_long_set(&counter->usage, new); } - propagate_protected_usage(counter, new); + if (track_protection(counter)) + propagate_protected_usage(counter, new); } /** @@ -70,12 +76,14 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages) void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) { struct page_counter *c; + bool protection = track_protection(counter); for (c = counter; c; c = c->parent) { long new; new = atomic_long_add_return(nr_pages, &c->usage); - propagate_protected_usage(c, new); + if (protection) + propagate_protected_usage(c, new); /* * This is indeed racy, but we can live with some * inaccuracy in the watermark. @@ -99,6 +107,7 @@ bool page_counter_try_charge(struct page_counter *counter, struct page_counter **fail) { struct page_counter *c; + bool protection = track_protection(counter); for (c = counter; c; c = c->parent) { long new; @@ -128,7 +137,9 @@ bool page_counter_try_charge(struct page_counter *counter, *fail = c; goto failed; } - propagate_protected_usage(c, new); + if (protection) + propagate_protected_usage(c, new); + /* * Just like with failcnt, we can live with some * inaccuracy in the watermark. From 941ce635234162c420c9185816c59f7d5dd1ad07 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 26 Jul 2024 20:31:09 +0000 Subject: [PATCH 031/416] mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCG Put page_counter_calculate_protection() under CONFIG_MEMCG. The protection functionality (min/low limits) is not supported by any other cgroup subsystem, so page_counter_calculate_protection() and related static effective_protection() can be compiled out if CONFIG_MEMCG is not enabled. Link: https://lkml.kernel.org/r/20240726203110.1577216-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/page_counter.h | 6 ++++++ mm/page_counter.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 0b8e993f9163..1b3e92c09b80 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -87,8 +87,14 @@ static inline void page_counter_reset_watermark(struct page_counter *counter) counter->watermark = page_counter_read(counter); } +#ifdef CONFIG_MEMCG void page_counter_calculate_protection(struct page_counter *root, struct page_counter *counter, bool recursive_protection); +#else +static inline void page_counter_calculate_protection(struct page_counter *root, + struct page_counter *counter, + bool recursive_protection) {} +#endif #endif /* _LINUX_PAGE_COUNTER_H */ diff --git a/mm/page_counter.c b/mm/page_counter.c index dd242245bee2..3887bd152b75 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -275,6 +275,7 @@ int page_counter_memparse(const char *buf, const char *max, } +#ifdef CONFIG_MEMCG /* * This function calculates an individual page counter's effective * protection which is derived from its own memory.min/low, its @@ -446,3 +447,4 @@ void page_counter_calculate_protection(struct page_counter *root, atomic_long_read(&parent->children_low_usage), recursive_protection)); } +#endif /* CONFIG_MEMCG */ From 57979fabff554bf4d1edeeee69251b22ca9bf55e Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 26 Jul 2024 20:31:10 +0000 Subject: [PATCH 032/416] mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macro When a page_counter structure is initialized, there is no need to use an atomic set operation to initialize the usage counter because at this point the structure is not visible to anybody else. ATOMIC_LONG_INIT() is what should be used in such cases. Link: https://lkml.kernel.org/r/20240726203110.1577216-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin Acked-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/page_counter.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 1b3e92c09b80..66ebf9a73158 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -52,7 +52,7 @@ static inline void page_counter_init(struct page_counter *counter, struct page_counter *parent, bool protection_support) { - atomic_long_set(&counter->usage, 0); + counter->usage = (atomic_long_t)ATOMIC_LONG_INIT(0); counter->max = PAGE_COUNTER_MAX; counter->parent = parent; counter->protection_support = protection_support; From 394290cba9664ed3ab80d0e247402102a9b7287a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 26 Jul 2024 17:07:26 +0200 Subject: [PATCH 033/416] mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/20240725183955.2268884-1-david@redhat.com This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/20240726150728.3159964-1-david@redhat.com Link: https://lkml.kernel.org/r/20240726150728.3159964-2-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Russell King (Oracle) Reviewed-by: Qi Zheng Cc: Alexander Viro Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Christian Brauner Cc: Christophe Leroy Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Juergen Gross Cc: Michael Ellerman Cc: Muchun Song Cc: "Naveen N. Rao" Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/arm/mm/fault-armv.c | 6 +++--- arch/x86/xen/mmu_pv.c | 7 ++++--- include/linux/mm.h | 8 ++++---- include/linux/mm_types.h | 2 +- include/linux/mm_types_task.h | 3 --- kernel/fork.c | 4 ++-- mm/Kconfig | 18 +++++++++++------- mm/memory.c | 2 +- 8 files changed, 26 insertions(+), 24 deletions(-) diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index 2286c2ea60ec..831793cd6ff9 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -61,7 +61,7 @@ static int do_adjust_pte(struct vm_area_struct *vma, unsigned long address, return ret; } -#if USE_SPLIT_PTE_PTLOCKS +#if defined(CONFIG_SPLIT_PTE_PTLOCKS) /* * If we are using split PTE locks, then we need to take the page * lock here. Otherwise we are using shared mm->page_table_lock @@ -80,10 +80,10 @@ static inline void do_pte_unlock(spinlock_t *ptl) { spin_unlock(ptl); } -#else /* !USE_SPLIT_PTE_PTLOCKS */ +#else /* !defined(CONFIG_SPLIT_PTE_PTLOCKS) */ static inline void do_pte_lock(spinlock_t *ptl) {} static inline void do_pte_unlock(spinlock_t *ptl) {} -#endif /* USE_SPLIT_PTE_PTLOCKS */ +#endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */ static int adjust_pte(struct vm_area_struct *vma, unsigned long address, unsigned long pfn) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index f1ce39d6d32c..f4a316894bbb 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -665,7 +665,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct mm_struct *mm) { spinlock_t *ptl = NULL; -#if USE_SPLIT_PTE_PTLOCKS +#if defined(CONFIG_SPLIT_PTE_PTLOCKS) ptl = ptlock_ptr(page_ptdesc(page)); spin_lock_nest_lock(ptl, &mm->page_table_lock); #endif @@ -1553,7 +1553,8 @@ static inline void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn, __set_pfn_prot(pfn, PAGE_KERNEL_RO); - if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS && !pinned) + if (level == PT_PTE && IS_ENABLED(CONFIG_SPLIT_PTE_PTLOCKS) && + !pinned) __pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn); xen_mc_issue(XEN_LAZY_MMU); @@ -1581,7 +1582,7 @@ static inline void xen_release_ptpage(unsigned long pfn, unsigned level) if (pinned) { xen_mc_batch(); - if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS) + if (level == PT_PTE && IS_ENABLED(CONFIG_SPLIT_PTE_PTLOCKS)) __pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, pfn); __set_pfn_prot(pfn, PAGE_KERNEL); diff --git a/include/linux/mm.h b/include/linux/mm.h index 6c2078ca3391..b7000fa300f3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2891,7 +2891,7 @@ static inline void pagetable_free(struct ptdesc *pt) __free_pages(page, compound_order(page)); } -#if USE_SPLIT_PTE_PTLOCKS +#if defined(CONFIG_SPLIT_PTE_PTLOCKS) #if ALLOC_SPLIT_PTLOCKS void __init ptlock_cache_init(void); bool ptlock_alloc(struct ptdesc *ptdesc); @@ -2949,7 +2949,7 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) return true; } -#else /* !USE_SPLIT_PTE_PTLOCKS */ +#else /* !defined(CONFIG_SPLIT_PTE_PTLOCKS) */ /* * We use mm->page_table_lock to guard all pagetable pages of the mm. */ @@ -2964,7 +2964,7 @@ static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } static inline void ptlock_free(struct ptdesc *ptdesc) {} -#endif /* USE_SPLIT_PTE_PTLOCKS */ +#endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */ static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc) { @@ -3024,7 +3024,7 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ NULL: pte_offset_kernel(pmd, address)) -#if USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_SPLIT_PMD_PTLOCKS) static inline struct page *pmd_pgtable_page(pmd_t *pmd) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 485424979254..165c58b12ccc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -947,7 +947,7 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_subscriptions *notifier_subscriptions; #endif -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index a2f6179b672b..bff5706b76e1 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -16,9 +16,6 @@ #include #endif -#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) -#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \ - IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK)) #define ALLOC_SPLIT_PTLOCKS (SPINLOCK_SIZE > BITS_PER_LONG/8) /* diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..3d590e51ce84 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -832,7 +832,7 @@ static void check_mm(struct mm_struct *mm) pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n", mm_pgtables_bytes(mm)); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) VM_BUG_ON_MM(mm->pmd_huge_pte, mm); #endif } @@ -1276,7 +1276,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, RCU_INIT_POINTER(mm->exe_file, NULL); mmu_notifier_subscriptions_init(mm); init_tlb_flush_pending(mm); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) mm->pmd_huge_pte = NULL; #endif mm_init_uprobes_state(mm); diff --git a/mm/Kconfig b/mm/Kconfig index b72e7d040f78..7b716ac80272 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -585,17 +585,21 @@ config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE # at the same time (e.g. copy_page_range()). # DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC spinlock_t also enlarge struct page. # -config SPLIT_PTLOCK_CPUS - int - default "999999" if !MMU - default "999999" if ARM && !CPU_CACHE_VIPT - default "999999" if PARISC && !PA20 - default "999999" if SPARC32 - default "4" +config SPLIT_PTE_PTLOCKS + def_bool y + depends on MMU + depends on NR_CPUS >= 4 + depends on !ARM || CPU_CACHE_VIPT + depends on !PARISC || PA20 + depends on !SPARC32 config ARCH_ENABLE_SPLIT_PMD_PTLOCK bool +config SPLIT_PMD_PTLOCKS + def_bool y + depends on SPLIT_PTE_PTLOCKS && ARCH_ENABLE_SPLIT_PMD_PTLOCK + # # support for memory balloon config MEMORY_BALLOON diff --git a/mm/memory.c b/mm/memory.c index a3d955098a19..71236ae6fd14 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6571,7 +6571,7 @@ long copy_folio_from_user(struct folio *dst_folio, } #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ -#if USE_SPLIT_PTE_PTLOCKS && ALLOC_SPLIT_PTLOCKS +#if defined(CONFIG_SPLIT_PTE_PTLOCKS) && ALLOC_SPLIT_PTLOCKS static struct kmem_cache *page_ptl_cachep; From 188cac58a8bcdf82c7f63275b68f7a46871e45d6 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 26 Jul 2024 17:07:27 +0200 Subject: [PATCH 034/416] mm/hugetlb: enforce that PMD PT sharing has split PMD PT locks Sharing page tables between processes but falling back to per-MM page table locks cannot possibly work. So, let's make sure that we do have split PMD locks by adding a new Kconfig option and letting that depend on CONFIG_SPLIT_PMD_PTLOCKS. Link: https://lkml.kernel.org/r/20240726150728.3159964-3-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) Cc: Alexander Viro Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Christian Brauner Cc: Christophe Leroy Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Juergen Gross Cc: Michael Ellerman Cc: Muchun Song Cc: "Naveen N. Rao" Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Russell King Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- fs/Kconfig | 4 ++++ include/linux/hugetlb.h | 5 ++--- mm/hugetlb.c | 8 ++++---- 3 files changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index a46b0cbc4d8f..0e4efec1d92e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -288,6 +288,10 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP depends on SPARSEMEM_VMEMMAP +config HUGETLB_PMD_PAGE_TABLE_SHARING + def_bool HUGETLB_PAGE + depends on ARCH_WANT_HUGE_PMD_SHARE && SPLIT_PMD_PTLOCKS + config ARCH_HAS_GIGANTIC_PAGE bool diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 45bf05ad5c53..9b7bcfce6920 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1251,7 +1251,7 @@ static inline __init void hugetlb_cma_reserve(int order) } #endif -#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING static inline bool hugetlb_pmd_shared(pte_t *pte) { return page_count(virt_to_page(pte)) > 1; @@ -1287,8 +1287,7 @@ bool __vma_private_lock(struct vm_area_struct *vma); static inline pte_t * hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz) { -#if defined(CONFIG_HUGETLB_PAGE) && \ - defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP) +#if defined(CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING) && defined(CONFIG_LOCKDEP) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5a32157ca309..1fdd9eab240c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7211,7 +7211,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end, return 0; } -#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING static unsigned long page_table_shareable(struct vm_area_struct *svma, struct vm_area_struct *vma, unsigned long addr, pgoff_t idx) @@ -7373,7 +7373,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, return 1; } -#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */ +#else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pud_t *pud) @@ -7396,7 +7396,7 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr) { return false; } -#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */ +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, @@ -7494,7 +7494,7 @@ unsigned long hugetlb_mask_last_page(struct hstate *h) /* See description above. Architectures can provide their own version. */ __weak unsigned long hugetlb_mask_last_page(struct hstate *h) { -#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING if (huge_page_size(h) == PMD_SIZE) return PUD_SIZE - PMD_SIZE; #endif From 073ebebd186250038a04be9693240f90f398c4b1 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 26 Jul 2024 17:07:28 +0200 Subject: [PATCH 035/416] powerpc/8xx: document and enforce that split PT locks are not used Right now, we cannot have split PT locks because 8xx does not support SMP. But for the sake of documentation *why* 8xx is fine regarding what we documented in huge_pte_lockptr(), let's just add code to enforce it at the same time as documenting it. This should also make everybody who wants to copy from the 8xx approach of supporting such unusual ways of mapping hugetlb folios aware that it gets tricky once multiple page tables are involved. Link: https://lkml.kernel.org/r/20240726150728.3159964-4-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Viro Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Christian Brauner Cc: Christophe Leroy Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Juergen Gross Cc: Michael Ellerman Cc: Mike Rapoport (Microsoft) Cc: Muchun Song Cc: "Naveen N. Rao" Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Peter Xu Cc: Russell King Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/powerpc/mm/pgtable.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index ab0656115424..7316396e452d 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -297,6 +297,12 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma, } #if defined(CONFIG_PPC_8xx) + +#if defined(CONFIG_SPLIT_PTE_PTLOCKS) || defined(CONFIG_SPLIT_PMD_PTLOCKS) +/* We need the same lock to protect the PMD table and the two PTE tables. */ +#error "8M hugetlb folios are incompatible with split page table locks" +#endif + static void __set_huge_pte_at(pmd_t *pmd, pte_t *ptep, pte_basic_t val) { pte_basic_t *entry = (pte_basic_t *)ptep; From 592c9330e3692c9636adddfe17004870ae8dc434 Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Fri, 26 Jul 2024 15:12:46 +0200 Subject: [PATCH 036/416] lib: test_hmm: use min() to improve dmirror_exclusive() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use min() to simplify the dmirror_exclusive() function and improve its readability. Link: https://lkml.kernel.org/r/20240726131245.161695-1-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum Reviewed-by: David Hildenbrand Cc: Jérôme Glisse Signed-off-by: Andrew Morton --- lib/test_hmm.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index ee20e1f9bae9..056f2e411d7b 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -799,10 +799,7 @@ static int dmirror_exclusive(struct dmirror *dmirror, unsigned long mapped = 0; int i; - if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT)) - next = end; - else - next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT); + next = min(end, addr + (ARRAY_SIZE(pages) << PAGE_SHIFT)); ret = make_device_exclusive_range(mm, addr, next, pages, NULL); /* From e5a41fc77771c7c7f6b9bef85a02b38b802b7371 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 29 Jul 2024 20:38:42 +0200 Subject: [PATCH 037/416] mm: simplify arch_make_folio_accessible() Patch series "mm: remove arch_make_page_accessible()". Now that s390x implements arch_make_folio_accessible(), let's convert remaining users to use arch_make_folio_accessible() instead so we can remove arch_make_page_accessible(). This patch (of 3): Now that s390x implements HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, let's turn generic arch_make_folio_accessible() into a NOP: there are no other targets that implement HAVE_ARCH_MAKE_PAGE_ACCESSIBLE but not HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE. Link: https://lkml.kernel.org/r/20240729183844.388481-1-david@redhat.com Link: https://lkml.kernel.org/r/20240729183844.388481-2-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Claudio Imbrenda Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: David Hildenbrand Cc: Heiko Carstens Cc: Janosch Frank Cc: Sven Schnelle Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- include/linux/mm.h | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index b7000fa300f3..fc7bd3864bcd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2215,16 +2215,7 @@ static inline int arch_make_page_accessible(struct page *page) #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE static inline int arch_make_folio_accessible(struct folio *folio) { - int ret; - long i, nr = folio_nr_pages(folio); - - for (i = 0; i < nr; i++) { - ret = arch_make_page_accessible(folio_page(folio, i)); - if (ret) - break; - } - - return ret; + return 0; } #endif From b967c64890d2938f16606d82ac6afcf25ed0129a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 29 Jul 2024 20:38:43 +0200 Subject: [PATCH 038/416] mm/gup: convert to arch_make_folio_accessible() Let's use arch_make_folio_accessible() instead so we can get rid of arch_make_page_accessible(). Link: https://lkml.kernel.org/r/20240729183844.388481-3-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Claudio Imbrenda Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Heiko Carstens Cc: Janosch Frank Cc: Sven Schnelle Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- mm/gup.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 120740cf5a34..3e8484c893aa 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -819,6 +819,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, struct dev_pagemap **pgmap) { struct mm_struct *mm = vma->vm_mm; + struct folio *folio; struct page *page; spinlock_t *ptl; pte_t *ptep, pte; @@ -876,6 +877,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto out; } } + folio = page_folio(page); if (!pte_write(pte) && gup_must_unshare(vma, flags, page)) { page = ERR_PTR(-EMLINK); @@ -886,7 +888,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, !PageAnonExclusive(page), page); /* try_grab_folio() does nothing unless FOLL_GET or FOLL_PIN is set. */ - ret = try_grab_folio(page_folio(page), 1, flags); + ret = try_grab_folio(folio, 1, flags); if (unlikely(ret)) { page = ERR_PTR(ret); goto out; @@ -898,7 +900,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, * Documentation/core-api/pin_user_pages.rst for details. */ if (flags & FOLL_PIN) { - ret = arch_make_page_accessible(page); + ret = arch_make_folio_accessible(folio); if (ret) { unpin_user_page(page); page = ERR_PTR(ret); @@ -2919,7 +2921,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, * details. */ if (flags & FOLL_PIN) { - ret = arch_make_page_accessible(page); + ret = arch_make_folio_accessible(folio); if (ret) { gup_put_folio(folio, 1, flags); goto pte_unmap; From 3290ef3c7f2a8171fc534e02fb95a512eeae689a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 29 Jul 2024 20:38:44 +0200 Subject: [PATCH 039/416] s390/uv: drop arch_make_page_accessible() All code was converted to using arch_make_folio_accessible(), let's drop arch_make_page_accessible(). Link: https://lkml.kernel.org/r/20240729183844.388481-4-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Claudio Imbrenda Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Heiko Carstens Cc: Janosch Frank Cc: Sven Schnelle Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- arch/s390/include/asm/page.h | 2 -- arch/s390/kernel/uv.c | 5 ----- include/linux/mm.h | 7 ------- 3 files changed, 14 deletions(-) diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h index 16e4caa931f1..73e1e03317b4 100644 --- a/arch/s390/include/asm/page.h +++ b/arch/s390/include/asm/page.h @@ -176,8 +176,6 @@ static inline int devmem_is_allowed(unsigned long pfn) int arch_make_folio_accessible(struct folio *folio); #define HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE -int arch_make_page_accessible(struct page *page); -#define HAVE_ARCH_MAKE_PAGE_ACCESSIBLE struct vm_layout { unsigned long kaslr_offset; diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c index 36db065c7cf7..35ed2aea8891 100644 --- a/arch/s390/kernel/uv.c +++ b/arch/s390/kernel/uv.c @@ -548,11 +548,6 @@ int arch_make_folio_accessible(struct folio *folio) } EXPORT_SYMBOL_GPL(arch_make_folio_accessible); -int arch_make_page_accessible(struct page *page) -{ - return arch_make_folio_accessible(page_folio(page)); -} -EXPORT_SYMBOL_GPL(arch_make_page_accessible); static ssize_t uv_query_facilities(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { diff --git a/include/linux/mm.h b/include/linux/mm.h index fc7bd3864bcd..153a4662a0cd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2205,13 +2205,6 @@ static inline bool folio_likely_mapped_shared(struct folio *folio) return atomic_read(&folio->_mapcount) > 0; } -#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE -static inline int arch_make_page_accessible(struct page *page) -{ - return 0; -} -#endif - #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE static inline int arch_make_folio_accessible(struct folio *folio) { From c6f53ed8f213a66ae8bc40aa9112c32412c35a21 Mon Sep 17 00:00:00 2001 From: David Finkel Date: Mon, 29 Jul 2024 10:37:42 -0400 Subject: [PATCH 040/416] mm, memcg: cg2 memory{.swap,}.peak write handlers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7. This patch (of 2): Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API. For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS. This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local. Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD. Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter. Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons. This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems. Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case. As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory. Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com Signed-off-by: David Finkel Suggested-by: Johannes Weiner Suggested-by: Waiman Long Acked-by: Johannes Weiner Reviewed-by: Michal Koutný Acked-by: Tejun Heo Reviewed-by: Roman Gushchin Cc: Jonathan Corbet Cc: Michal Hocko Cc: Muchun Song Cc: Shakeel Butt Cc: Shuah Khan Cc: Zefan Li Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 22 +++-- include/linux/cgroup-defs.h | 5 + include/linux/cgroup.h | 3 + include/linux/memcontrol.h | 5 + include/linux/page_counter.h | 11 ++- kernel/cgroup/cgroup-internal.h | 2 + kernel/cgroup/cgroup.c | 7 ++ mm/memcontrol.c | 118 +++++++++++++++++++++--- mm/page_counter.c | 29 ++++-- 9 files changed, 174 insertions(+), 28 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 86311c2907cd..f0499884124d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1333,11 +1333,14 @@ The following nested keys are defined. all the existing limitations and potential future extensions. memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. - The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD. + + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.oom.group A read-write single value file which exists on non-root @@ -1663,11 +1666,14 @@ The following nested keys are defined. Healthy workloads are not expected to reach this limit. memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. - The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD. + + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.swap.max A read-write single value file which exists on non-root diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ae04035b6cbe..7fc2d0195f56 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -775,6 +775,11 @@ struct cgroup_subsys { extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +struct cgroup_of_peak { + unsigned long value; + struct list_head list; +}; + /** * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups * @tsk: target task diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index c60ba0ab1462..3e0563753cc3 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -854,4 +855,6 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {} struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id); +struct cgroup_of_peak *of_peak(struct kernfs_open_file *of); + #endif /* _LINUX_CGROUP_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index af7da7bd00af..1b79760af685 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -193,6 +193,11 @@ struct mem_cgroup { struct page_counter memsw; /* v1 only */ }; + /* registered local peak watchers */ + struct list_head memory_peaks; + struct list_head swap_peaks; + spinlock_t peaks_lock; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 66ebf9a73158..79dbd8bc35a7 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -26,6 +26,8 @@ struct page_counter { atomic_long_t children_low_usage; unsigned long watermark; + /* Latest cg2 reset watermark */ + unsigned long local_watermark; unsigned long failcnt; /* Keep all the read most fields in a separete cacheline. */ @@ -84,7 +86,14 @@ int page_counter_memparse(const char *buf, const char *max, static inline void page_counter_reset_watermark(struct page_counter *counter) { - counter->watermark = page_counter_read(counter); + unsigned long usage = page_counter_read(counter); + + /* + * Update local_watermark first, so it's always <= watermark + * (modulo CPU/compiler re-ordering) + */ + counter->local_watermark = usage; + counter->watermark = usage; } #ifdef CONFIG_MEMCG diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 520b90dd97ec..c964dd7ff967 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -81,6 +81,8 @@ struct cgroup_file_ctx { struct { struct cgroup_pidlist *pidlist; } procs1; + + struct cgroup_of_peak peak; }; /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c8e4b62b436a..0a97cb2ef124 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1972,6 +1972,13 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param return -EINVAL; } +struct cgroup_of_peak *of_peak(struct kernfs_open_file *of) +{ + struct cgroup_file_ctx *ctx = of->priv; + + return &ctx->peak; +} + static void apply_cgroup_root_flags(unsigned int root_flags) { if (current->nsproxy->cgroup_ns == &init_cgroup_ns) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f92578f13b2e..8971d3473a7b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -25,6 +25,7 @@ * Copyright (C) 2020 Alibaba, Inc, Alex Shi */ +#include #include #include #include @@ -41,6 +42,7 @@ #include #include #include +#include #include #include #include @@ -3550,6 +3552,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) INIT_WORK(&memcg->high_work, high_work_func); vmpressure_init(&memcg->vmpressure); + INIT_LIST_HEAD(&memcg->memory_peaks); + INIT_LIST_HEAD(&memcg->swap_peaks); + spin_lock_init(&memcg->peaks_lock); memcg->socket_pressure = jiffies; memcg1_memcg_init(memcg); memcg->kmemcg_id = -1; @@ -3944,14 +3949,91 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE; } -static u64 memory_peak_read(struct cgroup_subsys_state *css, - struct cftype *cft) -{ - struct mem_cgroup *memcg = mem_cgroup_from_css(css); +#define OFP_PEAK_UNSET (((-1UL))) - return (u64)memcg->memory.watermark * PAGE_SIZE; +static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc) +{ + struct cgroup_of_peak *ofp = of_peak(sf->private); + u64 fd_peak = READ_ONCE(ofp->value), peak; + + /* User wants global or local peak? */ + if (fd_peak == OFP_PEAK_UNSET) + peak = pc->watermark; + else + peak = max(fd_peak, READ_ONCE(pc->local_watermark)); + + seq_printf(sf, "%llu\n", peak * PAGE_SIZE); + return 0; } +static int memory_peak_show(struct seq_file *sf, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf)); + + return peak_show(sf, v, &memcg->memory); +} + +static int peak_open(struct kernfs_open_file *of) +{ + struct cgroup_of_peak *ofp = of_peak(of); + + ofp->value = OFP_PEAK_UNSET; + return 0; +} + +static void peak_release(struct kernfs_open_file *of) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct cgroup_of_peak *ofp = of_peak(of); + + if (ofp->value == OFP_PEAK_UNSET) { + /* fast path (no writes on this fd) */ + return; + } + spin_lock(&memcg->peaks_lock); + list_del(&ofp->list); + spin_unlock(&memcg->peaks_lock); +} + +static ssize_t peak_write(struct kernfs_open_file *of, char *buf, size_t nbytes, + loff_t off, struct page_counter *pc, + struct list_head *watchers) +{ + unsigned long usage; + struct cgroup_of_peak *peer_ctx; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + struct cgroup_of_peak *ofp = of_peak(of); + + spin_lock(&memcg->peaks_lock); + + usage = page_counter_read(pc); + WRITE_ONCE(pc->local_watermark, usage); + + list_for_each_entry(peer_ctx, watchers, list) + if (usage > peer_ctx->value) + WRITE_ONCE(peer_ctx->value, usage); + + /* initial write, register watcher */ + if (ofp->value == -1) + list_add(&ofp->list, watchers); + + WRITE_ONCE(ofp->value, usage); + spin_unlock(&memcg->peaks_lock); + + return nbytes; +} + +static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + return peak_write(of, buf, nbytes, off, &memcg->memory, + &memcg->memory_peaks); +} + +#undef OFP_PEAK_UNSET + static int memory_min_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, @@ -4301,7 +4383,10 @@ static struct cftype memory_files[] = { { .name = "peak", .flags = CFTYPE_NOT_ON_ROOT, - .read_u64 = memory_peak_read, + .open = peak_open, + .release = peak_release, + .seq_show = memory_peak_show, + .write = memory_peak_write, }, { .name = "min", @@ -5093,12 +5178,20 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; } -static u64 swap_peak_read(struct cgroup_subsys_state *css, - struct cftype *cft) +static int swap_peak_show(struct seq_file *sf, void *v) { - struct mem_cgroup *memcg = mem_cgroup_from_css(css); + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf)); - return (u64)memcg->swap.watermark * PAGE_SIZE; + return peak_show(sf, v, &memcg->swap); +} + +static ssize_t swap_peak_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + return peak_write(of, buf, nbytes, off, &memcg->swap, + &memcg->swap_peaks); } static int swap_high_show(struct seq_file *m, void *v) @@ -5182,7 +5275,10 @@ static struct cftype swap_files[] = { { .name = "swap.peak", .flags = CFTYPE_NOT_ON_ROOT, - .read_u64 = swap_peak_read, + .open = peak_open, + .release = peak_release, + .seq_show = swap_peak_show, + .write = swap_peak_write, }, { .name = "swap.events", diff --git a/mm/page_counter.c b/mm/page_counter.c index 3887bd152b75..b249d15af9dd 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -87,9 +87,22 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) /* * This is indeed racy, but we can live with some * inaccuracy in the watermark. + * + * Notably, we have two watermarks to allow for both a globally + * visible peak and one that can be reset at a smaller scope. + * + * Since we reset both watermarks when the global reset occurs, + * we can guarantee that watermark >= local_watermark, so we + * don't need to do both comparisons every time. + * + * On systems with branch predictors, the inner condition should + * be almost free. */ - if (new > READ_ONCE(c->watermark)) - WRITE_ONCE(c->watermark, new); + if (new > READ_ONCE(c->local_watermark)) { + WRITE_ONCE(c->local_watermark, new); + if (new > READ_ONCE(c->watermark)) + WRITE_ONCE(c->watermark, new); + } } } @@ -140,12 +153,12 @@ bool page_counter_try_charge(struct page_counter *counter, if (protection) propagate_protected_usage(c, new); - /* - * Just like with failcnt, we can live with some - * inaccuracy in the watermark. - */ - if (new > READ_ONCE(c->watermark)) - WRITE_ONCE(c->watermark, new); + /* see comment on page_counter_charge */ + if (new > READ_ONCE(c->local_watermark)) { + WRITE_ONCE(c->local_watermark, new); + if (new > READ_ONCE(c->watermark)) + WRITE_ONCE(c->watermark, new); + } } return true; From d075bccec082be326e08d8b4f7cd491029438f0b Mon Sep 17 00:00:00 2001 From: David Finkel Date: Mon, 29 Jul 2024 10:37:43 -0400 Subject: [PATCH 041/416] mm, memcg: cg2 memory{.swap,}.peak write tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extend two existing tests to cover extracting memory usage through the newly mutable memory.peak and memory.swap.peak handlers. In particular, make sure to exercise adding and removing watchers with overlapping lifetimes so the less-trivial logic gets tested. The new/updated tests attempt to detect a lack of the write handler by fstat'ing the memory.peak and memory.swap.peak files and skip the tests if that's the case. Additionally, skip if the file doesn't exist at all. [davidf@vimeo.com: update tests] Link: https://lkml.kernel.org/r/20240730231304.761942-3-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-3-davidf@vimeo.com Signed-off-by: David Finkel Acked-by: Tejun Heo Reviewed-by: Roman Gushchin Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Shakeel Butt Cc: Shuah Khan Cc: Waiman Long Cc: Zefan Li Signed-off-by: Andrew Morton --- tools/testing/selftests/cgroup/cgroup_util.c | 22 ++ tools/testing/selftests/cgroup/cgroup_util.h | 2 + .../selftests/cgroup/test_memcontrol.c | 264 +++++++++++++++++- 3 files changed, 280 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c index 432db923bced..1e2d46636a0c 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.c +++ b/tools/testing/selftests/cgroup/cgroup_util.c @@ -141,6 +141,16 @@ long cg_read_long(const char *cgroup, const char *control) return atol(buf); } +long cg_read_long_fd(int fd) +{ + char buf[128]; + + if (pread(fd, buf, sizeof(buf), 0) <= 0) + return -1; + + return atol(buf); +} + long cg_read_key_long(const char *cgroup, const char *control, const char *key) { char buf[PAGE_SIZE]; @@ -183,6 +193,18 @@ int cg_write(const char *cgroup, const char *control, char *buf) return ret == len ? 0 : ret; } +/* + * Returns fd on success, or -1 on failure. + * (fd should be closed with close() as usual) + */ +int cg_open(const char *cgroup, const char *control, int flags) +{ + char path[PATH_MAX]; + + snprintf(path, sizeof(path), "%s/%s", cgroup, control); + return open(path, flags); +} + int cg_write_numeric(const char *cgroup, const char *control, long value) { char buf[64]; diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h index e8d04ac9e3d2..19b131ee7707 100644 --- a/tools/testing/selftests/cgroup/cgroup_util.h +++ b/tools/testing/selftests/cgroup/cgroup_util.h @@ -34,9 +34,11 @@ extern int cg_read_strcmp(const char *cgroup, const char *control, extern int cg_read_strstr(const char *cgroup, const char *control, const char *needle); extern long cg_read_long(const char *cgroup, const char *control); +extern long cg_read_long_fd(int fd); long cg_read_key_long(const char *cgroup, const char *control, const char *key); extern long cg_read_lc(const char *cgroup, const char *control); extern int cg_write(const char *cgroup, const char *control, char *buf); +extern int cg_open(const char *cgroup, const char *control, int flags); int cg_write_numeric(const char *cgroup, const char *control, long value); extern int cg_run(const char *cgroup, int (*fn)(const char *cgroup, void *arg), diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 41ae8047b889..16f5d74ae762 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -161,13 +161,16 @@ cleanup: /* * This test create a memory cgroup, allocates * some anonymous memory and some pagecache - * and check memory.current and some memory.stat values. + * and checks memory.current, memory.peak, and some memory.stat values. */ -static int test_memcg_current(const char *root) +static int test_memcg_current_peak(const char *root) { int ret = KSFT_FAIL; - long current; + long current, peak, peak_reset; char *memcg; + bool fd2_closed = false, fd3_closed = false, fd4_closed = false; + int peak_fd = -1, peak_fd2 = -1, peak_fd3 = -1, peak_fd4 = -1; + struct stat ss; memcg = cg_name(root, "memcg_test"); if (!memcg) @@ -180,15 +183,124 @@ static int test_memcg_current(const char *root) if (current != 0) goto cleanup; + peak = cg_read_long(memcg, "memory.peak"); + if (peak != 0) + goto cleanup; + if (cg_run(memcg, alloc_anon_50M_check, NULL)) goto cleanup; + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + /* + * We'll open a few FDs for the same memory.peak file to exercise the free-path + * We need at least three to be closed in a different order than writes occurred to test + * the linked-list handling. + */ + peak_fd = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd == -1) { + if (errno == ENOENT) + ret = KSFT_SKIP; + goto cleanup; + } + + /* + * Before we try to use memory.peak's fd, try to figure out whether + * this kernel supports writing to that file in the first place. (by + * checking the writable bit on the file's st_mode) + */ + if (fstat(peak_fd, &ss)) + goto cleanup; + + if ((ss.st_mode & S_IWUSR) == 0) { + ret = KSFT_SKIP; + goto cleanup; + } + + peak_fd2 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd2 == -1) + goto cleanup; + + peak_fd3 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd3 == -1) + goto cleanup; + + /* any non-empty string resets, but make it clear */ + static const char reset_string[] = "reset\n"; + + peak_reset = write(peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(peak_fd2, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(peak_fd3, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + /* Make sure a completely independent read isn't affected by our FD-local reset above*/ + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + fd2_closed = true; + if (close(peak_fd2)) + goto cleanup; + + peak_fd4 = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (peak_fd4 == -1) + goto cleanup; + + peak_reset = write(peak_fd4, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak = cg_read_long_fd(peak_fd); + if (peak > MB(30) || peak < 0) + goto cleanup; + if (cg_run(memcg, alloc_pagecache_50M_check, NULL)) goto cleanup; + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(50)) + goto cleanup; + + /* Make sure everything is back to normal */ + peak = cg_read_long_fd(peak_fd); + if (peak < MB(50)) + goto cleanup; + + peak = cg_read_long_fd(peak_fd4); + if (peak < MB(50)) + goto cleanup; + + fd3_closed = true; + if (close(peak_fd3)) + goto cleanup; + + fd4_closed = true; + if (close(peak_fd4)) + goto cleanup; + ret = KSFT_PASS; cleanup: + close(peak_fd); + if (!fd2_closed) + close(peak_fd2); + if (!fd3_closed) + close(peak_fd3); + if (!fd4_closed) + close(peak_fd4); cg_destroy(memcg); free(memcg); @@ -817,13 +929,19 @@ cleanup: /* * This test checks that memory.swap.max limits the amount of - * anonymous memory which can be swapped out. + * anonymous memory which can be swapped out. Additionally, it verifies that + * memory.swap.peak reflects the high watermark and can be reset. */ -static int test_memcg_swap_max(const char *root) +static int test_memcg_swap_max_peak(const char *root) { int ret = KSFT_FAIL; char *memcg; - long max; + long max, peak; + struct stat ss; + int swap_peak_fd = -1, mem_peak_fd = -1; + + /* any non-empty string resets */ + static const char reset_string[] = "foobarbaz"; if (!is_swap_enabled()) return KSFT_SKIP; @@ -840,6 +958,61 @@ static int test_memcg_swap_max(const char *root) goto cleanup; } + swap_peak_fd = cg_open(memcg, "memory.swap.peak", + O_RDWR | O_APPEND | O_CLOEXEC); + + if (swap_peak_fd == -1) { + if (errno == ENOENT) + ret = KSFT_SKIP; + goto cleanup; + } + + /* + * Before we try to use memory.swap.peak's fd, try to figure out + * whether this kernel supports writing to that file in the first + * place. (by checking the writable bit on the file's st_mode) + */ + if (fstat(swap_peak_fd, &ss)) + goto cleanup; + + if ((ss.st_mode & S_IWUSR) == 0) { + ret = KSFT_SKIP; + goto cleanup; + } + + mem_peak_fd = cg_open(memcg, "memory.peak", O_RDWR | O_APPEND | O_CLOEXEC); + + if (mem_peak_fd == -1) + goto cleanup; + + if (cg_read_long(memcg, "memory.swap.peak")) + goto cleanup; + + if (cg_read_long_fd(swap_peak_fd)) + goto cleanup; + + /* switch the swap and mem fds into local-peak tracking mode*/ + int peak_reset = write(swap_peak_fd, reset_string, sizeof(reset_string)); + + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + if (cg_read_long_fd(swap_peak_fd)) + goto cleanup; + + if (cg_read_long(memcg, "memory.peak")) + goto cleanup; + + if (cg_read_long_fd(mem_peak_fd)) + goto cleanup; + + peak_reset = write(mem_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + if (cg_read_long_fd(mem_peak_fd)) + goto cleanup; + if (cg_read_strcmp(memcg, "memory.max", "max\n")) goto cleanup; @@ -862,6 +1035,61 @@ static int test_memcg_swap_max(const char *root) if (cg_read_key_long(memcg, "memory.events", "oom_kill ") != 1) goto cleanup; + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long(memcg, "memory.swap.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(mem_peak_fd); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak < MB(29)) + goto cleanup; + + /* + * open, reset and close the peak swap on another FD to make sure + * multiple extant fds don't corrupt the linked-list + */ + peak_reset = cg_write(memcg, "memory.swap.peak", (char *)reset_string); + if (peak_reset) + goto cleanup; + + peak_reset = cg_write(memcg, "memory.peak", (char *)reset_string); + if (peak_reset) + goto cleanup; + + /* actually reset on the fds */ + peak_reset = write(swap_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak_reset = write(mem_peak_fd, reset_string, sizeof(reset_string)); + if (peak_reset != sizeof(reset_string)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak > MB(10)) + goto cleanup; + + /* + * The cgroup is now empty, but there may be a page or two associated + * with the open FD accounted to it. + */ + peak = cg_read_long_fd(mem_peak_fd); + if (peak > MB(1)) + goto cleanup; + + if (cg_read_long(memcg, "memory.peak") < MB(29)) + goto cleanup; + + if (cg_read_long(memcg, "memory.swap.peak") < MB(29)) + goto cleanup; + if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30))) goto cleanup; @@ -869,9 +1097,29 @@ static int test_memcg_swap_max(const char *root) if (max <= 0) goto cleanup; + peak = cg_read_long(memcg, "memory.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long(memcg, "memory.swap.peak"); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(mem_peak_fd); + if (peak < MB(29)) + goto cleanup; + + peak = cg_read_long_fd(swap_peak_fd); + if (peak < MB(19)) + goto cleanup; + ret = KSFT_PASS; cleanup: + if (mem_peak_fd != -1 && close(mem_peak_fd)) + ret = KSFT_FAIL; + if (swap_peak_fd != -1 && close(swap_peak_fd)) + ret = KSFT_FAIL; cg_destroy(memcg); free(memcg); @@ -1295,7 +1543,7 @@ struct memcg_test { const char *name; } tests[] = { T(test_memcg_subtree_control), - T(test_memcg_current), + T(test_memcg_current_peak), T(test_memcg_min), T(test_memcg_low), T(test_memcg_high), @@ -1303,7 +1551,7 @@ struct memcg_test { T(test_memcg_max), T(test_memcg_reclaim), T(test_memcg_oom_events), - T(test_memcg_swap_max), + T(test_memcg_swap_max_peak), T(test_memcg_sock), T(test_memcg_oom_group_leaf_events), T(test_memcg_oom_group_parent_events), From a17c7d8fd2b097a422fc892b660e879e19c20214 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:35 +0100 Subject: [PATCH 042/416] userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c Patch series "Make core VMA operations internal and testable", v4. There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there. More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself. This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h. Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h. This means we can then re-implement vma_internal.h in userland, adding shims for kernel mechanisms as required, allowing us to unit test internal VMA functionality. This testing is useful as opposed to an e.g. kunit implementation as this way we can avoid all external kernel side-effects while testing, run tests VERY quickly, and iterate on and debug problems quickly. Excitingly this opens the door to, in the future, recreating precise problems observed in production in userland and very quickly debugging problems that might otherwise be very difficult to reproduce. This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing. Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland. A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly. This patch (of 4): This patch forms part of a patch series intending to separate out VMA logic and render it testable from userspace, which requires that core manipulation functions be exposed in an mm/-internal header file. In order to do this, we must abstract APIs we wish to test, in this instance functions which ultimately invoke vma_modify(). This patch therefore moves all logic which ultimately invokes vma_modify() to mm/userfaultfd.c, trying to transfer code at a functional granularity where possible. [lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 160 +++----------------------------- include/linux/userfaultfd_k.h | 19 ++++ mm/userfaultfd.c | 168 ++++++++++++++++++++++++++++++++++ 3 files changed, 198 insertions(+), 149 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 27a3e9285fbf..b3ed7207df7e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -104,21 +104,6 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; } -static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, - vm_flags_t flags) -{ - const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; - - vm_flags_reset(vma, flags); - /* - * For shared mappings, we want to enable writenotify while - * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply - * recalculate vma->vm_page_prot whenever userfaultfd-wp changes. - */ - if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed) - vma_set_page_prot(vma); -} - static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, int wake_flags, void *key) { @@ -615,22 +600,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, spin_unlock_irq(&ctx->event_wqh.lock); if (release_new_ctx) { - struct vm_area_struct *vma; - struct mm_struct *mm = release_new_ctx->mm; - VMA_ITERATOR(vmi, mm, 0); - - /* the various vma->vm_userfaultfd_ctx still points to it */ - mmap_write_lock(mm); - for_each_vma(vmi, vma) { - if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { - vma_start_write(vma); - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - userfaultfd_set_vm_flags(vma, - vma->vm_flags & ~__VM_UFFD_FLAGS); - } - } - mmap_write_unlock(mm); - + userfaultfd_release_new(release_new_ctx); userfaultfd_ctx_put(release_new_ctx); } @@ -662,9 +632,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) return 0; if (!(octx->features & UFFD_FEATURE_EVENT_FORK)) { - vma_start_write(vma); - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); + userfaultfd_reset_ctx(vma); return 0; } @@ -749,9 +717,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, up_write(&ctx->map_changing_lock); } else { /* Drop uffd context if remap feature not enabled */ - vma_start_write(vma); - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); + userfaultfd_reset_ctx(vma); } } @@ -870,53 +836,13 @@ static int userfaultfd_release(struct inode *inode, struct file *file) { struct userfaultfd_ctx *ctx = file->private_data; struct mm_struct *mm = ctx->mm; - struct vm_area_struct *vma, *prev; /* len == 0 means wake all */ struct userfaultfd_wake_range range = { .len = 0, }; - unsigned long new_flags; - VMA_ITERATOR(vmi, mm, 0); WRITE_ONCE(ctx->released, true); - if (!mmget_not_zero(mm)) - goto wakeup; + userfaultfd_release_all(mm, ctx); - /* - * Flush page faults out of all CPUs. NOTE: all page faults - * must be retried without returning VM_FAULT_SIGBUS if - * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx - * changes while handle_userfault released the mmap_lock. So - * it's critical that released is set to true (above), before - * taking the mmap_lock for writing. - */ - mmap_write_lock(mm); - prev = NULL; - for_each_vma(vmi, vma) { - cond_resched(); - BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ - !!(vma->vm_flags & __VM_UFFD_FLAGS)); - if (vma->vm_userfaultfd_ctx.ctx != ctx) { - prev = vma; - continue; - } - /* Reset ptes for the whole vma range if wr-protected */ - if (userfaultfd_wp(vma)) - uffd_wp_range(vma, vma->vm_start, - vma->vm_end - vma->vm_start, false); - new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; - vma = vma_modify_flags_uffd(&vmi, prev, vma, vma->vm_start, - vma->vm_end, new_flags, - NULL_VM_UFFD_CTX); - - vma_start_write(vma); - userfaultfd_set_vm_flags(vma, new_flags); - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - - prev = vma; - } - mmap_write_unlock(mm); - mmput(mm); -wakeup: /* * After no new page faults can wait on this fault_*wqh, flush * the last page faults that may have been already waiting on @@ -1293,14 +1219,14 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, unsigned long arg) { struct mm_struct *mm = ctx->mm; - struct vm_area_struct *vma, *prev, *cur; + struct vm_area_struct *vma, *cur; int ret; struct uffdio_register uffdio_register; struct uffdio_register __user *user_uffdio_register; - unsigned long vm_flags, new_flags; + unsigned long vm_flags; bool found; bool basic_ioctls; - unsigned long start, end, vma_end; + unsigned long start, end; struct vma_iterator vmi; bool wp_async = userfaultfd_wp_async_ctx(ctx); @@ -1428,57 +1354,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, } for_each_vma_range(vmi, cur, end); BUG_ON(!found); - vma_iter_set(&vmi, start); - prev = vma_prev(&vmi); - if (vma->vm_start < start) - prev = vma; - - ret = 0; - for_each_vma_range(vmi, vma, end) { - cond_resched(); - - BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); - BUG_ON(vma->vm_userfaultfd_ctx.ctx && - vma->vm_userfaultfd_ctx.ctx != ctx); - WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); - - /* - * Nothing to do: this vma is already registered into this - * userfaultfd and with the right tracking mode too. - */ - if (vma->vm_userfaultfd_ctx.ctx == ctx && - (vma->vm_flags & vm_flags) == vm_flags) - goto skip; - - if (vma->vm_start > start) - start = vma->vm_start; - vma_end = min(end, vma->vm_end); - - new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; - vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, - new_flags, - (struct vm_userfaultfd_ctx){ctx}); - if (IS_ERR(vma)) { - ret = PTR_ERR(vma); - break; - } - - /* - * In the vma_merge() successful mprotect-like case 8: - * the next vma was merged into the current one and - * the current one has not been updated yet. - */ - vma_start_write(vma); - userfaultfd_set_vm_flags(vma, new_flags); - vma->vm_userfaultfd_ctx.ctx = ctx; - - if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) - hugetlb_unshare_all_pmds(vma); - - skip: - prev = vma; - start = vma->vm_end; - } + ret = userfaultfd_register_range(ctx, vma, vm_flags, start, end, + wp_async); out_unlock: mmap_write_unlock(mm); @@ -1519,7 +1396,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, struct vm_area_struct *vma, *prev, *cur; int ret; struct uffdio_range uffdio_unregister; - unsigned long new_flags; bool found; unsigned long start, end, vma_end; const void __user *buf = (void __user *)arg; @@ -1622,27 +1498,13 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range); } - /* Reset ptes for the whole vma range if wr-protected */ - if (userfaultfd_wp(vma)) - uffd_wp_range(vma, start, vma_end - start, false); - - new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; - vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, - new_flags, NULL_VM_UFFD_CTX); + vma = userfaultfd_clear_vma(&vmi, prev, vma, + start, vma_end); if (IS_ERR(vma)) { ret = PTR_ERR(vma); break; } - /* - * In the vma_merge() successful mprotect-like case 8: - * the next vma was merged into the current one and - * the current one has not been updated yet. - */ - vma_start_write(vma); - userfaultfd_set_vm_flags(vma, new_flags); - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - skip: prev = vma; start = vma->vm_end; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index a12bcf042551..9fc6ce15c499 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -267,6 +267,25 @@ extern void userfaultfd_unmap_complete(struct mm_struct *mm, extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); extern bool userfaultfd_wp_async(struct vm_area_struct *vma); +void userfaultfd_reset_ctx(struct vm_area_struct *vma); + +struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end); + +int userfaultfd_register_range(struct userfaultfd_ctx *ctx, + struct vm_area_struct *vma, + unsigned long vm_flags, + unsigned long start, unsigned long end, + bool wp_async); + +void userfaultfd_release_new(struct userfaultfd_ctx *ctx); + +void userfaultfd_release_all(struct mm_struct *mm, + struct userfaultfd_ctx *ctx); + #else /* CONFIG_USERFAULTFD */ /* mm helpers */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e54e5c8907fa..966e6c81a685 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1760,3 +1760,171 @@ out: VM_WARN_ON(!moved && !err); return moved ? moved : err; } + +static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, + vm_flags_t flags) +{ + const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; + + vm_flags_reset(vma, flags); + /* + * For shared mappings, we want to enable writenotify while + * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply + * recalculate vma->vm_page_prot whenever userfaultfd-wp changes. + */ + if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed) + vma_set_page_prot(vma); +} + +static void userfaultfd_set_ctx(struct vm_area_struct *vma, + struct userfaultfd_ctx *ctx, + unsigned long flags) +{ + vma_start_write(vma); + vma->vm_userfaultfd_ctx = (struct vm_userfaultfd_ctx){ctx}; + userfaultfd_set_vm_flags(vma, + (vma->vm_flags & ~__VM_UFFD_FLAGS) | flags); +} + +void userfaultfd_reset_ctx(struct vm_area_struct *vma) +{ + userfaultfd_set_ctx(vma, NULL, 0); +} + +struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end) +{ + struct vm_area_struct *ret; + + /* Reset ptes for the whole vma range if wr-protected */ + if (userfaultfd_wp(vma)) + uffd_wp_range(vma, start, end - start, false); + + ret = vma_modify_flags_uffd(vmi, prev, vma, start, end, + vma->vm_flags & ~__VM_UFFD_FLAGS, + NULL_VM_UFFD_CTX); + + /* + * In the vma_merge() successful mprotect-like case 8: + * the next vma was merged into the current one and + * the current one has not been updated yet. + */ + if (!IS_ERR(ret)) + userfaultfd_reset_ctx(ret); + + return ret; +} + +/* Assumes mmap write lock taken, and mm_struct pinned. */ +int userfaultfd_register_range(struct userfaultfd_ctx *ctx, + struct vm_area_struct *vma, + unsigned long vm_flags, + unsigned long start, unsigned long end, + bool wp_async) +{ + VMA_ITERATOR(vmi, ctx->mm, start); + struct vm_area_struct *prev = vma_prev(&vmi); + unsigned long vma_end; + unsigned long new_flags; + + if (vma->vm_start < start) + prev = vma; + + for_each_vma_range(vmi, vma, end) { + cond_resched(); + + BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); + BUG_ON(vma->vm_userfaultfd_ctx.ctx && + vma->vm_userfaultfd_ctx.ctx != ctx); + WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); + + /* + * Nothing to do: this vma is already registered into this + * userfaultfd and with the right tracking mode too. + */ + if (vma->vm_userfaultfd_ctx.ctx == ctx && + (vma->vm_flags & vm_flags) == vm_flags) + goto skip; + + if (vma->vm_start > start) + start = vma->vm_start; + vma_end = min(end, vma->vm_end); + + new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; + vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, + new_flags, + (struct vm_userfaultfd_ctx){ctx}); + if (IS_ERR(vma)) + return PTR_ERR(vma); + + /* + * In the vma_merge() successful mprotect-like case 8: + * the next vma was merged into the current one and + * the current one has not been updated yet. + */ + userfaultfd_set_ctx(vma, ctx, vm_flags); + + if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) + hugetlb_unshare_all_pmds(vma); + +skip: + prev = vma; + start = vma->vm_end; + } + + return 0; +} + +void userfaultfd_release_new(struct userfaultfd_ctx *ctx) +{ + struct mm_struct *mm = ctx->mm; + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); + + /* the various vma->vm_userfaultfd_ctx still points to it */ + mmap_write_lock(mm); + for_each_vma(vmi, vma) { + if (vma->vm_userfaultfd_ctx.ctx == ctx) + userfaultfd_reset_ctx(vma); + } + mmap_write_unlock(mm); +} + +void userfaultfd_release_all(struct mm_struct *mm, + struct userfaultfd_ctx *ctx) +{ + struct vm_area_struct *vma, *prev; + VMA_ITERATOR(vmi, mm, 0); + + if (!mmget_not_zero(mm)) + return; + + /* + * Flush page faults out of all CPUs. NOTE: all page faults + * must be retried without returning VM_FAULT_SIGBUS if + * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx + * changes while handle_userfault released the mmap_lock. So + * it's critical that released is set to true (above), before + * taking the mmap_lock for writing. + */ + mmap_write_lock(mm); + prev = NULL; + for_each_vma(vmi, vma) { + cond_resched(); + BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ + !!(vma->vm_flags & __VM_UFFD_FLAGS)); + if (vma->vm_userfaultfd_ctx.ctx != ctx) { + prev = vma; + continue; + } + + vma = userfaultfd_clear_vma(&vmi, prev, vma, + vma->vm_start, vma->vm_end); + prev = vma; + } + mmap_write_unlock(mm); + mmput(mm); +} From fa04c08f3ce649b47781c7663e92398197f5b551 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:36 +0100 Subject: [PATCH 043/416] mm: move vma_modify() and helpers to internal header These are core VMA manipulation functions which invoke VMA splitting and merging and should not be directly accessed from outside of mm/. Link: https://lkml.kernel.org/r/5efde0c6342a8860d5ffc90b415f3989fd8ed0b2.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- include/linux/mm.h | 60 --------------------------------------------- mm/internal.h | 61 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+), 60 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 153a4662a0cd..c4054e68dc09 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3279,66 +3279,6 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **, unsigned long addr, unsigned long len, pgoff_t pgoff, bool *need_rmap_locks); extern void exit_mmap(struct mm_struct *); -struct vm_area_struct *vma_modify(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long vm_flags, - struct mempolicy *policy, - struct vm_userfaultfd_ctx uffd_ctx, - struct anon_vma_name *anon_name); - -/* We are about to modify the VMA's flags. */ -static inline struct vm_area_struct -*vma_modify_flags(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long new_flags) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, - anon_vma_name(vma)); -} - -/* We are about to modify the VMA's flags and/or anon_name. */ -static inline struct vm_area_struct -*vma_modify_flags_name(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, - unsigned long end, - unsigned long new_flags, - struct anon_vma_name *new_name) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, new_name); -} - -/* We are about to modify the VMA's memory policy. */ -static inline struct vm_area_struct -*vma_modify_policy(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - struct mempolicy *new_pol) -{ - return vma_modify(vmi, prev, vma, start, end, vma->vm_flags, - new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma)); -} - -/* We are about to modify the VMA's flags and/or uffd context. */ -static inline struct vm_area_struct -*vma_modify_flags_uffd(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long new_flags, - struct vm_userfaultfd_ctx new_ctx) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), new_ctx, anon_vma_name(vma)); -} static inline int check_data_rlimit(unsigned long rlim, unsigned long new, diff --git a/mm/internal.h b/mm/internal.h index b4d86436565b..81564ce0f9e2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1244,6 +1244,67 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long delta); +struct vm_area_struct *vma_modify(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long vm_flags, + struct mempolicy *policy, + struct vm_userfaultfd_ctx uffd_ctx, + struct anon_vma_name *anon_name); + +/* We are about to modify the VMA's flags. */ +static inline struct vm_area_struct +*vma_modify_flags(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), vma->vm_userfaultfd_ctx, + anon_vma_name(vma)); +} + +/* We are about to modify the VMA's flags and/or anon_name. */ +static inline struct vm_area_struct +*vma_modify_flags_name(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + unsigned long new_flags, + struct anon_vma_name *new_name) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), vma->vm_userfaultfd_ctx, new_name); +} + +/* We are about to modify the VMA's memory policy. */ +static inline struct vm_area_struct +*vma_modify_policy(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + struct mempolicy *new_pol) +{ + return vma_modify(vmi, prev, vma, start, end, vma->vm_flags, + new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma)); +} + +/* We are about to modify the VMA's flags and/or uffd context. */ +static inline struct vm_area_struct +*vma_modify_flags_uffd(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags, + struct vm_userfaultfd_ctx new_ctx) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), new_ctx, anon_vma_name(vma)); +} + enum { /* mark page accessed */ FOLL_TOUCH = 1 << 16, From d61f0d59683d9c899211c300254d4140c482a6c0 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:37 +0100 Subject: [PATCH 044/416] mm: move vma_shrink(), vma_expand() to internal header The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code. To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h. The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable. Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- fs/exec.c | 81 ++++------------------------------------------ include/linux/mm.h | 17 +--------- mm/internal.h | 18 +++++++++++ mm/mmap.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 106 insertions(+), 91 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 50e76cc633c4..9009e2b2805e 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -711,80 +711,6 @@ static int copy_strings_kernel(int argc, const char *const *argv, #ifdef CONFIG_MMU -/* - * During bprm_mm_init(), we create a temporary stack at STACK_TOP_MAX. Once - * the binfmt code determines where the new stack should reside, we shift it to - * its final location. The process proceeds as follows: - * - * 1) Use shift to calculate the new vma endpoints. - * 2) Extend vma to cover both the old and new ranges. This ensures the - * arguments passed to subsequent functions are consistent. - * 3) Move vma's page tables to the new range. - * 4) Free up any cleared pgd range. - * 5) Shrink the vma to cover only the new range. - */ -static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) -{ - struct mm_struct *mm = vma->vm_mm; - unsigned long old_start = vma->vm_start; - unsigned long old_end = vma->vm_end; - unsigned long length = old_end - old_start; - unsigned long new_start = old_start - shift; - unsigned long new_end = old_end - shift; - VMA_ITERATOR(vmi, mm, new_start); - struct vm_area_struct *next; - struct mmu_gather tlb; - - BUG_ON(new_start > new_end); - - /* - * ensure there are no vmas between where we want to go - * and where we are - */ - if (vma != vma_next(&vmi)) - return -EFAULT; - - vma_iter_prev_range(&vmi); - /* - * cover the whole range: [new_start, old_end) - */ - if (vma_expand(&vmi, vma, new_start, old_end, vma->vm_pgoff, NULL)) - return -ENOMEM; - - /* - * move the page tables downwards, on failure we rely on - * process cleanup to remove whatever mess we made. - */ - if (length != move_page_tables(vma, old_start, - vma, new_start, length, false, true)) - return -ENOMEM; - - lru_add_drain(); - tlb_gather_mmu(&tlb, mm); - next = vma_next(&vmi); - if (new_end > old_start) { - /* - * when the old and new regions overlap clear from new_end. - */ - free_pgd_range(&tlb, new_end, old_end, new_end, - next ? next->vm_start : USER_PGTABLES_CEILING); - } else { - /* - * otherwise, clean from old_start; this is done to not touch - * the address space in [new_end, old_start) some architectures - * have constraints on va-space that make this illegal (IA64) - - * for the others its just a little faster. - */ - free_pgd_range(&tlb, old_start, old_end, new_end, - next ? next->vm_start : USER_PGTABLES_CEILING); - } - tlb_finish_mmu(&tlb); - - vma_prev(&vmi); - /* Shrink the vma to just the new range */ - return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff); -} - /* * Finalizes the stack vm_area_struct. The flags and permissions are updated, * the stack is optionally relocated, and some extra space is added. @@ -877,7 +803,12 @@ int setup_arg_pages(struct linux_binprm *bprm, /* Move stack pages down in memory. */ if (stack_shift) { - ret = shift_arg_pages(vma, stack_shift); + /* + * During bprm_mm_init(), we create a temporary stack at STACK_TOP_MAX. Once + * the binfmt code determines where the new stack should reside, we shift it to + * its final location. + */ + ret = relocate_vma_down(vma, stack_shift); if (ret) goto out_unlock; } diff --git a/include/linux/mm.h b/include/linux/mm.h index c4054e68dc09..1c0d88446eb8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1005,12 +1005,6 @@ static inline struct vm_area_struct *vma_prev(struct vma_iterator *vmi) return mas_prev(&vmi->mas, 0); } -static inline -struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) -{ - return mas_prev_range(&vmi->mas, 0); -} - static inline unsigned long vma_iter_addr(struct vma_iterator *vmi) { return vmi->mas.index; @@ -2520,11 +2514,6 @@ int set_page_dirty_lock(struct page *page); int get_cmdline(struct task_struct *task, char *buffer, int buflen); -extern unsigned long move_page_tables(struct vm_area_struct *vma, - unsigned long old_addr, struct vm_area_struct *new_vma, - unsigned long new_addr, unsigned long len, - bool need_rmap_locks, bool for_stack); - /* * Flags used by change_protection(). For now we make it a bitmap so * that we can pass in multiple flags just like parameters. However @@ -3267,11 +3256,6 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); /* mmap.c */ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); -extern int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff, - struct vm_area_struct *next); -extern int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff); extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); extern void unlink_file_vma(struct vm_area_struct *); @@ -3279,6 +3263,7 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **, unsigned long addr, unsigned long len, pgoff_t pgoff, bool *need_rmap_locks); extern void exit_mmap(struct mm_struct *); +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); static inline int check_data_rlimit(unsigned long rlim, unsigned long new, diff --git a/mm/internal.h b/mm/internal.h index 81564ce0f9e2..a4d0e98ccb97 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1305,6 +1305,12 @@ static inline struct vm_area_struct vma_policy(vma), new_ctx, anon_vma_name(vma)); } +int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff, + struct vm_area_struct *next); +int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff); + enum { /* mark page accessed */ FOLL_TOUCH = 1 << 16, @@ -1528,6 +1534,12 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi, return 0; } +static inline +struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) +{ + return mas_prev_range(&vmi->mas, 0); +} + /* * VMA lock generalization */ @@ -1639,4 +1651,10 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); +/* mremap.c */ +unsigned long move_page_tables(struct vm_area_struct *vma, + unsigned long old_addr, struct vm_area_struct *new_vma, + unsigned long new_addr, unsigned long len, + bool need_rmap_locks, bool for_stack); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..211148ba2831 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -4088,3 +4088,84 @@ static int __meminit init_reserve_notifier(void) return 0; } subsys_initcall(init_reserve_notifier); + +/* + * Relocate a VMA downwards by shift bytes. There cannot be any VMAs between + * this VMA and its relocated range, which will now reside at [vma->vm_start - + * shift, vma->vm_end - shift). + * + * This function is almost certainly NOT what you want for anything other than + * early executable temporary stack relocation. + */ +int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) +{ + /* + * The process proceeds as follows: + * + * 1) Use shift to calculate the new vma endpoints. + * 2) Extend vma to cover both the old and new ranges. This ensures the + * arguments passed to subsequent functions are consistent. + * 3) Move vma's page tables to the new range. + * 4) Free up any cleared pgd range. + * 5) Shrink the vma to cover only the new range. + */ + + struct mm_struct *mm = vma->vm_mm; + unsigned long old_start = vma->vm_start; + unsigned long old_end = vma->vm_end; + unsigned long length = old_end - old_start; + unsigned long new_start = old_start - shift; + unsigned long new_end = old_end - shift; + VMA_ITERATOR(vmi, mm, new_start); + struct vm_area_struct *next; + struct mmu_gather tlb; + + BUG_ON(new_start > new_end); + + /* + * ensure there are no vmas between where we want to go + * and where we are + */ + if (vma != vma_next(&vmi)) + return -EFAULT; + + vma_iter_prev_range(&vmi); + /* + * cover the whole range: [new_start, old_end) + */ + if (vma_expand(&vmi, vma, new_start, old_end, vma->vm_pgoff, NULL)) + return -ENOMEM; + + /* + * move the page tables downwards, on failure we rely on + * process cleanup to remove whatever mess we made. + */ + if (length != move_page_tables(vma, old_start, + vma, new_start, length, false, true)) + return -ENOMEM; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm); + next = vma_next(&vmi); + if (new_end > old_start) { + /* + * when the old and new regions overlap clear from new_end. + */ + free_pgd_range(&tlb, new_end, old_end, new_end, + next ? next->vm_start : USER_PGTABLES_CEILING); + } else { + /* + * otherwise, clean from old_start; this is done to not touch + * the address space in [new_end, old_start) some architectures + * have constraints on va-space that make this illegal (IA64) - + * for the others its just a little faster. + */ + free_pgd_range(&tlb, old_start, old_end, new_end, + next ? next->vm_start : USER_PGTABLES_CEILING); + } + tlb_finish_mmu(&tlb); + + vma_prev(&vmi); + /* Shrink the vma to just the new range */ + return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff); +} From 49b1b8d6f6831026cb105b0eafa18f13db612d86 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:38 +0100 Subject: [PATCH 045/416] mm: move internal core VMA manipulation functions to own file This patch introduces vma.c and moves internal core VMA manipulation functions to this file from mmap.c. This allows us to isolate VMA functionality in a single place such that we can create userspace testing code that invokes this functionality in an environment where we can implement simple unit tests of core functionality. This patch ensures that core VMA functionality is explicitly marked as such by its presence in mm/vma.h. It also places the header includes required by vma.c in vma_internal.h, which is simply imported by vma.c. This makes the VMA functionality testable, as userland testing code can simply stub out functionality as required. Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- include/linux/mm.h | 35 - mm/Makefile | 2 +- mm/internal.h | 236 +----- mm/mmap.c | 1772 -------------------------------------------- mm/mmu_notifier.c | 2 + mm/vma.c | 1766 +++++++++++++++++++++++++++++++++++++++++++ mm/vma.h | 364 +++++++++ mm/vma_internal.h | 50 ++ 8 files changed, 2188 insertions(+), 2039 deletions(-) create mode 100644 mm/vma.c create mode 100644 mm/vma.h create mode 100644 mm/vma_internal.h diff --git a/include/linux/mm.h b/include/linux/mm.h index 1c0d88446eb8..bd219ac9c026 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1005,21 +1005,6 @@ static inline struct vm_area_struct *vma_prev(struct vma_iterator *vmi) return mas_prev(&vmi->mas, 0); } -static inline unsigned long vma_iter_addr(struct vma_iterator *vmi) -{ - return vmi->mas.index; -} - -static inline unsigned long vma_iter_end(struct vma_iterator *vmi) -{ - return vmi->mas.last + 1; -} -static inline int vma_iter_bulk_alloc(struct vma_iterator *vmi, - unsigned long count) -{ - return mas_expected_entries(&vmi->mas, count); -} - static inline int vma_iter_clear_gfp(struct vma_iterator *vmi, unsigned long start, unsigned long end, gfp_t gfp) { @@ -2534,21 +2519,6 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen); #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) -bool vma_needs_dirty_tracking(struct vm_area_struct *vma); -bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); -static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) -{ - /* - * We want to check manually if we can change individual PTEs writable - * if we can't do that automatically for all PTEs in a mapping. For - * private mappings, that's always the case when we have write - * permissions as we properly have to handle COW. - */ - if (vma->vm_flags & VM_SHARED) - return vma_wants_writenotify(vma, vma->vm_page_prot); - return !!(vma->vm_flags & VM_WRITE); - -} bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, pte_t pte); extern long change_protection(struct mmu_gather *tlb, @@ -3256,12 +3226,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); /* mmap.c */ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); -extern void unlink_file_vma(struct vm_area_struct *); -extern struct vm_area_struct *copy_vma(struct vm_area_struct **, - unsigned long addr, unsigned long len, pgoff_t pgoff, - bool *need_rmap_locks); extern void exit_mmap(struct mm_struct *); int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift); diff --git a/mm/Makefile b/mm/Makefile index b11fefbc13bc..24b918260c7f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -37,7 +37,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o + pgtable-generic.o rmap.o vmalloc.o vma.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/internal.h b/mm/internal.h index a4d0e98ccb97..1159b04e76a3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -8,13 +8,18 @@ #define __MM_INTERNAL_H #include +#include #include +#include #include #include #include #include #include +/* Internal core VMA manipulation functions. */ +#include "vma.h" + struct folio_batch; /* @@ -778,37 +783,6 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) return list_empty(&area->free_list[migratetype]); } -/* - * These three helpers classifies VMAs for virtual memory accounting. - */ - -/* - * Executable code area - executable, not writable, not stack - */ -static inline bool is_exec_mapping(vm_flags_t flags) -{ - return (flags & (VM_EXEC | VM_WRITE | VM_STACK)) == VM_EXEC; -} - -/* - * Stack area (including shadow stacks) - * - * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous: - * do_mmap() forbids all other combinations. - */ -static inline bool is_stack_mapping(vm_flags_t flags) -{ - return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK); -} - -/* - * Data area - private, writable, not stack - */ -static inline bool is_data_mapping(vm_flags_t flags) -{ - return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE; -} - /* mm/util.c */ struct anon_vma *folio_anon_vma(struct folio *folio); @@ -1237,80 +1211,6 @@ void touch_pud(struct vm_area_struct *vma, unsigned long addr, void touch_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, bool write); -/* - * mm/mmap.c - */ -struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, - struct vm_area_struct *vma, - unsigned long delta); - -struct vm_area_struct *vma_modify(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long vm_flags, - struct mempolicy *policy, - struct vm_userfaultfd_ctx uffd_ctx, - struct anon_vma_name *anon_name); - -/* We are about to modify the VMA's flags. */ -static inline struct vm_area_struct -*vma_modify_flags(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long new_flags) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, - anon_vma_name(vma)); -} - -/* We are about to modify the VMA's flags and/or anon_name. */ -static inline struct vm_area_struct -*vma_modify_flags_name(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, - unsigned long end, - unsigned long new_flags, - struct anon_vma_name *new_name) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, new_name); -} - -/* We are about to modify the VMA's memory policy. */ -static inline struct vm_area_struct -*vma_modify_policy(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - struct mempolicy *new_pol) -{ - return vma_modify(vmi, prev, vma, start, end, vma->vm_flags, - new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma)); -} - -/* We are about to modify the VMA's flags and/or uffd context. */ -static inline struct vm_area_struct -*vma_modify_flags_uffd(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long new_flags, - struct vm_userfaultfd_ctx new_ctx) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), new_ctx, anon_vma_name(vma)); -} - -int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff, - struct vm_area_struct *next); -int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff); - enum { /* mark page accessed */ FOLL_TOUCH = 1 << 16, @@ -1437,123 +1337,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte); } -static inline void vma_iter_config(struct vma_iterator *vmi, - unsigned long index, unsigned long last) -{ - __mas_set_range(&vmi->mas, index, last - 1); -} - -static inline void vma_iter_reset(struct vma_iterator *vmi) -{ - mas_reset(&vmi->mas); -} - -static inline -struct vm_area_struct *vma_iter_prev_range_limit(struct vma_iterator *vmi, unsigned long min) -{ - return mas_prev_range(&vmi->mas, min); -} - -static inline -struct vm_area_struct *vma_iter_next_range_limit(struct vma_iterator *vmi, unsigned long max) -{ - return mas_next_range(&vmi->mas, max); -} - -static inline int vma_iter_area_lowest(struct vma_iterator *vmi, unsigned long min, - unsigned long max, unsigned long size) -{ - return mas_empty_area(&vmi->mas, min, max - 1, size); -} - -static inline int vma_iter_area_highest(struct vma_iterator *vmi, unsigned long min, - unsigned long max, unsigned long size) -{ - return mas_empty_area_rev(&vmi->mas, min, max - 1, size); -} - -/* - * VMA Iterator functions shared between nommu and mmap - */ -static inline int vma_iter_prealloc(struct vma_iterator *vmi, - struct vm_area_struct *vma) -{ - return mas_preallocate(&vmi->mas, vma, GFP_KERNEL); -} - -static inline void vma_iter_clear(struct vma_iterator *vmi) -{ - mas_store_prealloc(&vmi->mas, NULL); -} - -static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi) -{ - return mas_walk(&vmi->mas); -} - -/* Store a VMA with preallocated memory */ -static inline void vma_iter_store(struct vma_iterator *vmi, - struct vm_area_struct *vma) -{ - -#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) - if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start && - vmi->mas.index > vma->vm_start)) { - pr_warn("%lx > %lx\n store vma %lx-%lx\n into slot %lx-%lx\n", - vmi->mas.index, vma->vm_start, vma->vm_start, - vma->vm_end, vmi->mas.index, vmi->mas.last); - } - if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start && - vmi->mas.last < vma->vm_start)) { - pr_warn("%lx < %lx\nstore vma %lx-%lx\ninto slot %lx-%lx\n", - vmi->mas.last, vma->vm_start, vma->vm_start, vma->vm_end, - vmi->mas.index, vmi->mas.last); - } -#endif - - if (vmi->mas.status != ma_start && - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) - vma_iter_invalidate(vmi); - - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); - mas_store_prealloc(&vmi->mas, vma); -} - -static inline int vma_iter_store_gfp(struct vma_iterator *vmi, - struct vm_area_struct *vma, gfp_t gfp) -{ - if (vmi->mas.status != ma_start && - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) - vma_iter_invalidate(vmi); - - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); - mas_store_gfp(&vmi->mas, vma, gfp); - if (unlikely(mas_is_err(&vmi->mas))) - return -ENOMEM; - - return 0; -} - -static inline -struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) -{ - return mas_prev_range(&vmi->mas, 0); -} - -/* - * VMA lock generalization - */ -struct vma_prepare { - struct vm_area_struct *vma; - struct vm_area_struct *adj_next; - struct file *file; - struct address_space *mapping; - struct anon_vma *anon_vma; - struct vm_area_struct *insert; - struct vm_area_struct *remove; - struct vm_area_struct *remove2; -}; - void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid); @@ -1642,15 +1425,6 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry, void workingset_update_node(struct xa_node *node); extern struct list_lru shadow_nodes; -struct unlink_vma_file_batch { - int count; - struct vm_area_struct *vmas[8]; -}; - -void unlink_file_vma_batch_init(struct unlink_vma_file_batch *); -void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *); -void unlink_file_vma_batch_final(struct unlink_vma_file_batch *); - /* mremap.c */ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, diff --git a/mm/mmap.c b/mm/mmap.c index 211148ba2831..4a9c2329b09a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -76,16 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS; static bool ignore_rlimit_data; core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644); -static void unmap_region(struct mm_struct *mm, struct ma_state *mas, - struct vm_area_struct *vma, struct vm_area_struct *prev, - struct vm_area_struct *next, unsigned long start, - unsigned long end, unsigned long tree_end, bool mm_wr_locked); - -static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags) -{ - return pgprot_modify(oldprot, vm_get_page_prot(vm_flags)); -} - /* Update vma->vm_page_prot to reflect vma->vm_flags. */ void vma_set_page_prot(struct vm_area_struct *vma) { @@ -101,100 +91,6 @@ void vma_set_page_prot(struct vm_area_struct *vma) WRITE_ONCE(vma->vm_page_prot, vm_page_prot); } -/* - * Requires inode->i_mapping->i_mmap_rwsem - */ -static void __remove_shared_vm_struct(struct vm_area_struct *vma, - struct address_space *mapping) -{ - if (vma_is_shared_maywrite(vma)) - mapping_unmap_writable(mapping); - - flush_dcache_mmap_lock(mapping); - vma_interval_tree_remove(vma, &mapping->i_mmap); - flush_dcache_mmap_unlock(mapping); -} - -/* - * Unlink a file-based vm structure from its interval tree, to hide - * vma from rmap and vmtruncate before freeing its page tables. - */ -void unlink_file_vma(struct vm_area_struct *vma) -{ - struct file *file = vma->vm_file; - - if (file) { - struct address_space *mapping = file->f_mapping; - i_mmap_lock_write(mapping); - __remove_shared_vm_struct(vma, mapping); - i_mmap_unlock_write(mapping); - } -} - -void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) -{ - vb->count = 0; -} - -static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb) -{ - struct address_space *mapping; - int i; - - mapping = vb->vmas[0]->vm_file->f_mapping; - i_mmap_lock_write(mapping); - for (i = 0; i < vb->count; i++) { - VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping); - __remove_shared_vm_struct(vb->vmas[i], mapping); - } - i_mmap_unlock_write(mapping); - - unlink_file_vma_batch_init(vb); -} - -void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb, - struct vm_area_struct *vma) -{ - if (vma->vm_file == NULL) - return; - - if ((vb->count > 0 && vb->vmas[0]->vm_file != vma->vm_file) || - vb->count == ARRAY_SIZE(vb->vmas)) - unlink_file_vma_batch_process(vb); - - vb->vmas[vb->count] = vma; - vb->count++; -} - -void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb) -{ - if (vb->count > 0) - unlink_file_vma_batch_process(vb); -} - -/* - * Close a vm structure and free it. - */ -static void remove_vma(struct vm_area_struct *vma, bool unreachable) -{ - might_sleep(); - if (vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); - if (vma->vm_file) - fput(vma->vm_file); - mpol_put(vma_policy(vma)); - if (unreachable) - __vm_area_free(vma); - else - vm_area_free(vma); -} - -static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, - unsigned long min) -{ - return mas_prev(&vmi->mas, min); -} - /* * check_brk_limits() - Use platform specific check of range & verify mlock * limits. @@ -318,875 +214,6 @@ out: return origbrk; } -#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) -static void validate_mm(struct mm_struct *mm) -{ - int bug = 0; - int i = 0; - struct vm_area_struct *vma; - VMA_ITERATOR(vmi, mm, 0); - - mt_validate(&mm->mm_mt); - for_each_vma(vmi, vma) { -#ifdef CONFIG_DEBUG_VM_RB - struct anon_vma *anon_vma = vma->anon_vma; - struct anon_vma_chain *avc; -#endif - unsigned long vmi_start, vmi_end; - bool warn = 0; - - vmi_start = vma_iter_addr(&vmi); - vmi_end = vma_iter_end(&vmi); - if (VM_WARN_ON_ONCE_MM(vma->vm_end != vmi_end, mm)) - warn = 1; - - if (VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm)) - warn = 1; - - if (warn) { - pr_emerg("issue in %s\n", current->comm); - dump_stack(); - dump_vma(vma); - pr_emerg("tree range: %px start %lx end %lx\n", vma, - vmi_start, vmi_end - 1); - vma_iter_dump_tree(&vmi); - } - -#ifdef CONFIG_DEBUG_VM_RB - if (anon_vma) { - anon_vma_lock_read(anon_vma); - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) - anon_vma_interval_tree_verify(avc); - anon_vma_unlock_read(anon_vma); - } -#endif - i++; - } - if (i != mm->map_count) { - pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i); - bug = 1; - } - VM_BUG_ON_MM(bug, mm); -} - -#else /* !CONFIG_DEBUG_VM_MAPLE_TREE */ -#define validate_mm(mm) do { } while (0) -#endif /* CONFIG_DEBUG_VM_MAPLE_TREE */ - -/* - * vma has some anon_vma assigned, and is already inserted on that - * anon_vma's interval trees. - * - * Before updating the vma's vm_start / vm_end / vm_pgoff fields, the - * vma must be removed from the anon_vma's interval trees using - * anon_vma_interval_tree_pre_update_vma(). - * - * After the update, the vma will be reinserted using - * anon_vma_interval_tree_post_update_vma(). - * - * The entire update must be protected by exclusive mmap_lock and by - * the root anon_vma's mutex. - */ -static inline void -anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma) -{ - struct anon_vma_chain *avc; - - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) - anon_vma_interval_tree_remove(avc, &avc->anon_vma->rb_root); -} - -static inline void -anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) -{ - struct anon_vma_chain *avc; - - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) - anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root); -} - -static unsigned long count_vma_pages_range(struct mm_struct *mm, - unsigned long addr, unsigned long end) -{ - VMA_ITERATOR(vmi, mm, addr); - struct vm_area_struct *vma; - unsigned long nr_pages = 0; - - for_each_vma_range(vmi, vma, end) { - unsigned long vm_start = max(addr, vma->vm_start); - unsigned long vm_end = min(end, vma->vm_end); - - nr_pages += PHYS_PFN(vm_end - vm_start); - } - - return nr_pages; -} - -static void __vma_link_file(struct vm_area_struct *vma, - struct address_space *mapping) -{ - if (vma_is_shared_maywrite(vma)) - mapping_allow_writable(mapping); - - flush_dcache_mmap_lock(mapping); - vma_interval_tree_insert(vma, &mapping->i_mmap); - flush_dcache_mmap_unlock(mapping); -} - -static void vma_link_file(struct vm_area_struct *vma) -{ - struct file *file = vma->vm_file; - struct address_space *mapping; - - if (file) { - mapping = file->f_mapping; - i_mmap_lock_write(mapping); - __vma_link_file(vma, mapping); - i_mmap_unlock_write(mapping); - } -} - -static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) -{ - VMA_ITERATOR(vmi, mm, 0); - - vma_iter_config(&vmi, vma->vm_start, vma->vm_end); - if (vma_iter_prealloc(&vmi, vma)) - return -ENOMEM; - - vma_start_write(vma); - vma_iter_store(&vmi, vma); - vma_link_file(vma); - mm->map_count++; - validate_mm(mm); - return 0; -} - -/* - * init_multi_vma_prep() - Initializer for struct vma_prepare - * @vp: The vma_prepare struct - * @vma: The vma that will be altered once locked - * @next: The next vma if it is to be adjusted - * @remove: The first vma to be removed - * @remove2: The second vma to be removed - */ -static inline void init_multi_vma_prep(struct vma_prepare *vp, - struct vm_area_struct *vma, struct vm_area_struct *next, - struct vm_area_struct *remove, struct vm_area_struct *remove2) -{ - memset(vp, 0, sizeof(struct vma_prepare)); - vp->vma = vma; - vp->anon_vma = vma->anon_vma; - vp->remove = remove; - vp->remove2 = remove2; - vp->adj_next = next; - if (!vp->anon_vma && next) - vp->anon_vma = next->anon_vma; - - vp->file = vma->vm_file; - if (vp->file) - vp->mapping = vma->vm_file->f_mapping; - -} - -/* - * init_vma_prep() - Initializer wrapper for vma_prepare struct - * @vp: The vma_prepare struct - * @vma: The vma that will be altered once locked - */ -static inline void init_vma_prep(struct vma_prepare *vp, - struct vm_area_struct *vma) -{ - init_multi_vma_prep(vp, vma, NULL, NULL, NULL); -} - - -/* - * vma_prepare() - Helper function for handling locking VMAs prior to altering - * @vp: The initialized vma_prepare struct - */ -static inline void vma_prepare(struct vma_prepare *vp) -{ - if (vp->file) { - uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); - - if (vp->adj_next) - uprobe_munmap(vp->adj_next, vp->adj_next->vm_start, - vp->adj_next->vm_end); - - i_mmap_lock_write(vp->mapping); - if (vp->insert && vp->insert->vm_file) { - /* - * Put into interval tree now, so instantiated pages - * are visible to arm/parisc __flush_dcache_page - * throughout; but we cannot insert into address - * space until vma start or end is updated. - */ - __vma_link_file(vp->insert, - vp->insert->vm_file->f_mapping); - } - } - - if (vp->anon_vma) { - anon_vma_lock_write(vp->anon_vma); - anon_vma_interval_tree_pre_update_vma(vp->vma); - if (vp->adj_next) - anon_vma_interval_tree_pre_update_vma(vp->adj_next); - } - - if (vp->file) { - flush_dcache_mmap_lock(vp->mapping); - vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap); - if (vp->adj_next) - vma_interval_tree_remove(vp->adj_next, - &vp->mapping->i_mmap); - } - -} - -/* - * vma_complete- Helper function for handling the unlocking after altering VMAs, - * or for inserting a VMA. - * - * @vp: The vma_prepare struct - * @vmi: The vma iterator - * @mm: The mm_struct - */ -static inline void vma_complete(struct vma_prepare *vp, - struct vma_iterator *vmi, struct mm_struct *mm) -{ - if (vp->file) { - if (vp->adj_next) - vma_interval_tree_insert(vp->adj_next, - &vp->mapping->i_mmap); - vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap); - flush_dcache_mmap_unlock(vp->mapping); - } - - if (vp->remove && vp->file) { - __remove_shared_vm_struct(vp->remove, vp->mapping); - if (vp->remove2) - __remove_shared_vm_struct(vp->remove2, vp->mapping); - } else if (vp->insert) { - /* - * split_vma has split insert from vma, and needs - * us to insert it before dropping the locks - * (it may either follow vma or precede it). - */ - vma_iter_store(vmi, vp->insert); - mm->map_count++; - } - - if (vp->anon_vma) { - anon_vma_interval_tree_post_update_vma(vp->vma); - if (vp->adj_next) - anon_vma_interval_tree_post_update_vma(vp->adj_next); - anon_vma_unlock_write(vp->anon_vma); - } - - if (vp->file) { - i_mmap_unlock_write(vp->mapping); - uprobe_mmap(vp->vma); - - if (vp->adj_next) - uprobe_mmap(vp->adj_next); - } - - if (vp->remove) { -again: - vma_mark_detached(vp->remove, true); - if (vp->file) { - uprobe_munmap(vp->remove, vp->remove->vm_start, - vp->remove->vm_end); - fput(vp->file); - } - if (vp->remove->anon_vma) - anon_vma_merge(vp->vma, vp->remove); - mm->map_count--; - mpol_put(vma_policy(vp->remove)); - if (!vp->remove2) - WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); - vm_area_free(vp->remove); - - /* - * In mprotect's case 6 (see comments on vma_merge), - * we are removing both mid and next vmas - */ - if (vp->remove2) { - vp->remove = vp->remove2; - vp->remove2 = NULL; - goto again; - } - } - if (vp->insert && vp->file) - uprobe_mmap(vp->insert); - validate_mm(mm); -} - -/* - * dup_anon_vma() - Helper function to duplicate anon_vma - * @dst: The destination VMA - * @src: The source VMA - * @dup: Pointer to the destination VMA when successful. - * - * Returns: 0 on success. - */ -static inline int dup_anon_vma(struct vm_area_struct *dst, - struct vm_area_struct *src, struct vm_area_struct **dup) -{ - /* - * Easily overlooked: when mprotect shifts the boundary, make sure the - * expanding vma has anon_vma set if the shrinking vma had, to cover any - * anon pages imported. - */ - if (src->anon_vma && !dst->anon_vma) { - int ret; - - vma_assert_write_locked(dst); - dst->anon_vma = src->anon_vma; - ret = anon_vma_clone(dst, src); - if (ret) - return ret; - - *dup = dst; - } - - return 0; -} - -/* - * vma_expand - Expand an existing VMA - * - * @vmi: The vma iterator - * @vma: The vma to expand - * @start: The start of the vma - * @end: The exclusive end of the vma - * @pgoff: The page offset of vma - * @next: The current of next vma. - * - * Expand @vma to @start and @end. Can expand off the start and end. Will - * expand over @next if it's different from @vma and @end == @next->vm_end. - * Checking if the @vma can expand and merge with @next needs to be handled by - * the caller. - * - * Returns: 0 on success - */ -int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff, - struct vm_area_struct *next) -{ - struct vm_area_struct *anon_dup = NULL; - bool remove_next = false; - struct vma_prepare vp; - - vma_start_write(vma); - if (next && (vma != next) && (end == next->vm_end)) { - int ret; - - remove_next = true; - vma_start_write(next); - ret = dup_anon_vma(vma, next, &anon_dup); - if (ret) - return ret; - } - - init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); - /* Not merging but overwriting any part of next is not handled. */ - VM_WARN_ON(next && !vp.remove && - next != vma && end > next->vm_start); - /* Only handles expanding */ - VM_WARN_ON(vma->vm_start < start || vma->vm_end > end); - - /* Note: vma iterator must be pointing to 'start' */ - vma_iter_config(vmi, start, end); - if (vma_iter_prealloc(vmi, vma)) - goto nomem; - - vma_prepare(&vp); - vma_adjust_trans_huge(vma, start, end, 0); - vma_set_range(vma, start, end, pgoff); - vma_iter_store(vmi, vma); - - vma_complete(&vp, vmi, vma->vm_mm); - return 0; - -nomem: - if (anon_dup) - unlink_anon_vmas(anon_dup); - return -ENOMEM; -} - -/* - * vma_shrink() - Reduce an existing VMAs memory area - * @vmi: The vma iterator - * @vma: The VMA to modify - * @start: The new start - * @end: The new end - * - * Returns: 0 on success, -ENOMEM otherwise - */ -int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff) -{ - struct vma_prepare vp; - - WARN_ON((vma->vm_start != start) && (vma->vm_end != end)); - - if (vma->vm_start < start) - vma_iter_config(vmi, vma->vm_start, start); - else - vma_iter_config(vmi, end, vma->vm_end); - - if (vma_iter_prealloc(vmi, NULL)) - return -ENOMEM; - - vma_start_write(vma); - - init_vma_prep(&vp, vma); - vma_prepare(&vp); - vma_adjust_trans_huge(vma, start, end, 0); - - vma_iter_clear(vmi); - vma_set_range(vma, start, end, pgoff); - vma_complete(&vp, vmi, vma->vm_mm); - return 0; -} - -/* - * If the vma has a ->close operation then the driver probably needs to release - * per-vma resources, so we don't attempt to merge those if the caller indicates - * the current vma may be removed as part of the merge. - */ -static inline bool is_mergeable_vma(struct vm_area_struct *vma, - struct file *file, unsigned long vm_flags, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name, bool may_remove_vma) -{ - /* - * VM_SOFTDIRTY should not prevent from VMA merging, if we - * match the flags but dirty bit -- the caller should mark - * merged VMA as dirty. If dirty bit won't be excluded from - * comparison, we increase pressure on the memory system forcing - * the kernel to generate new VMAs when old one could be - * extended instead. - */ - if ((vma->vm_flags ^ vm_flags) & ~VM_SOFTDIRTY) - return false; - if (vma->vm_file != file) - return false; - if (may_remove_vma && vma->vm_ops && vma->vm_ops->close) - return false; - if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx)) - return false; - if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) - return false; - return true; -} - -static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, - struct anon_vma *anon_vma2, struct vm_area_struct *vma) -{ - /* - * The list_is_singular() test is to avoid merging VMA cloned from - * parents. This can improve scalability caused by anon_vma lock. - */ - if ((!anon_vma1 || !anon_vma2) && (!vma || - list_is_singular(&vma->anon_vma_chain))) - return true; - return anon_vma1 == anon_vma2; -} - -/* - * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) - * in front of (at a lower virtual address and file offset than) the vma. - * - * We cannot merge two vmas if they have differently assigned (non-NULL) - * anon_vmas, nor if same anon_vma is assigned but offsets incompatible. - * - * We don't check here for the merged mmap wrapping around the end of pagecache - * indices (16TB on ia32) because do_mmap() does not permit mmap's which - * wrap, nor mmaps which cover the final page at index -1UL. - * - * We assume the vma may be removed as part of the merge. - */ -static bool -can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) -{ - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, true) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { - if (vma->vm_pgoff == vm_pgoff) - return true; - } - return false; -} - -/* - * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) - * beyond (at a higher virtual address and file offset than) the vma. - * - * We cannot merge two vmas if they have differently assigned (non-NULL) - * anon_vmas, nor if same anon_vma is assigned but offsets incompatible. - * - * We assume that vma is not removed as part of the merge. - */ -static bool -can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) -{ - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, false) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { - pgoff_t vm_pglen; - vm_pglen = vma_pages(vma); - if (vma->vm_pgoff + vm_pglen == vm_pgoff) - return true; - } - return false; -} - -/* - * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), - * figure out whether that can be merged with its predecessor or its - * successor. Or both (it neatly fills a hole). - * - * In most cases - when called for mmap, brk or mremap - [addr,end) is - * certain not to be mapped by the time vma_merge is called; but when - * called for mprotect, it is certain to be already mapped (either at - * an offset within prev, or at the start of next), and the flags of - * this area are about to be changed to vm_flags - and the no-change - * case has already been eliminated. - * - * The following mprotect cases have to be considered, where **** is - * the area passed down from mprotect_fixup, never extending beyond one - * vma, PPPP is the previous vma, CCCC is a concurrent vma that starts - * at the same address as **** and is of the same or larger span, and - * NNNN the next vma after ****: - * - * **** **** **** - * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPCCCCCC - * cannot merge might become might become - * PPNNNNNNNNNN PPPPPPPPPPCC - * mmap, brk or case 4 below case 5 below - * mremap move: - * **** **** - * PPPP NNNN PPPPCCCCNNNN - * might become might become - * PPPPPPPPPPPP 1 or PPPPPPPPPPPP 6 or - * PPPPPPPPNNNN 2 or PPPPPPPPNNNN 7 or - * PPPPNNNNNNNN 3 PPPPNNNNNNNN 8 - * - * It is important for case 8 that the vma CCCC overlapping the - * region **** is never going to extended over NNNN. Instead NNNN must - * be extended in region **** and CCCC must be removed. This way in - * all cases where vma_merge succeeds, the moment vma_merge drops the - * rmap_locks, the properties of the merged vma will be already - * correct for the whole merged range. Some of those properties like - * vm_page_prot/vm_flags may be accessed by rmap_walks and they must - * be correct for the whole merged range immediately after the - * rmap_locks are released. Otherwise if NNNN would be removed and - * CCCC would be extended over the NNNN range, remove_migration_ptes - * or other rmap walkers (if working on addresses beyond the "end" - * parameter) may establish ptes with the wrong permissions of CCCC - * instead of the right permissions of NNNN. - * - * In the code below: - * PPPP is represented by *prev - * CCCC is represented by *curr or not represented at all (NULL) - * NNNN is represented by *next or not represented at all (NULL) - * **** is not represented - it will be merged and the vma containing the - * area is returned, or the function will return NULL - */ -static struct vm_area_struct -*vma_merge(struct vma_iterator *vmi, struct vm_area_struct *prev, - struct vm_area_struct *src, unsigned long addr, unsigned long end, - unsigned long vm_flags, pgoff_t pgoff, struct mempolicy *policy, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) -{ - struct mm_struct *mm = src->vm_mm; - struct anon_vma *anon_vma = src->anon_vma; - struct file *file = src->vm_file; - struct vm_area_struct *curr, *next, *res; - struct vm_area_struct *vma, *adjust, *remove, *remove2; - struct vm_area_struct *anon_dup = NULL; - struct vma_prepare vp; - pgoff_t vma_pgoff; - int err = 0; - bool merge_prev = false; - bool merge_next = false; - bool vma_expanded = false; - unsigned long vma_start = addr; - unsigned long vma_end = end; - pgoff_t pglen = (end - addr) >> PAGE_SHIFT; - long adj_start = 0; - - /* - * We later require that vma->vm_flags == vm_flags, - * so this tests vma->vm_flags & VM_SPECIAL, too. - */ - if (vm_flags & VM_SPECIAL) - return NULL; - - /* Does the input range span an existing VMA? (cases 5 - 8) */ - curr = find_vma_intersection(mm, prev ? prev->vm_end : 0, end); - - if (!curr || /* cases 1 - 4 */ - end == curr->vm_end) /* cases 6 - 8, adjacent VMA */ - next = vma_lookup(mm, end); - else - next = NULL; /* case 5 */ - - if (prev) { - vma_start = prev->vm_start; - vma_pgoff = prev->vm_pgoff; - - /* Can we merge the predecessor? */ - if (addr == prev->vm_end && mpol_equal(vma_policy(prev), policy) - && can_vma_merge_after(prev, vm_flags, anon_vma, file, - pgoff, vm_userfaultfd_ctx, anon_name)) { - merge_prev = true; - vma_prev(vmi); - } - } - - /* Can we merge the successor? */ - if (next && mpol_equal(policy, vma_policy(next)) && - can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, - vm_userfaultfd_ctx, anon_name)) { - merge_next = true; - } - - /* Verify some invariant that must be enforced by the caller. */ - VM_WARN_ON(prev && addr <= prev->vm_start); - VM_WARN_ON(curr && (addr != curr->vm_start || end > curr->vm_end)); - VM_WARN_ON(addr >= end); - - if (!merge_prev && !merge_next) - return NULL; /* Not mergeable. */ - - if (merge_prev) - vma_start_write(prev); - - res = vma = prev; - remove = remove2 = adjust = NULL; - - /* Can we merge both the predecessor and the successor? */ - if (merge_prev && merge_next && - is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { - vma_start_write(next); - remove = next; /* case 1 */ - vma_end = next->vm_end; - err = dup_anon_vma(prev, next, &anon_dup); - if (curr) { /* case 6 */ - vma_start_write(curr); - remove = curr; - remove2 = next; - /* - * Note that the dup_anon_vma below cannot overwrite err - * since the first caller would do nothing unless next - * has an anon_vma. - */ - if (!next->anon_vma) - err = dup_anon_vma(prev, curr, &anon_dup); - } - } else if (merge_prev) { /* case 2 */ - if (curr) { - vma_start_write(curr); - if (end == curr->vm_end) { /* case 7 */ - /* - * can_vma_merge_after() assumed we would not be - * removing prev vma, so it skipped the check - * for vm_ops->close, but we are removing curr - */ - if (curr->vm_ops && curr->vm_ops->close) - err = -EINVAL; - remove = curr; - } else { /* case 5 */ - adjust = curr; - adj_start = (end - curr->vm_start); - } - if (!err) - err = dup_anon_vma(prev, curr, &anon_dup); - } - } else { /* merge_next */ - vma_start_write(next); - res = next; - if (prev && addr < prev->vm_end) { /* case 4 */ - vma_start_write(prev); - vma_end = addr; - adjust = next; - adj_start = -(prev->vm_end - addr); - err = dup_anon_vma(next, prev, &anon_dup); - } else { - /* - * Note that cases 3 and 8 are the ONLY ones where prev - * is permitted to be (but is not necessarily) NULL. - */ - vma = next; /* case 3 */ - vma_start = addr; - vma_end = next->vm_end; - vma_pgoff = next->vm_pgoff - pglen; - if (curr) { /* case 8 */ - vma_pgoff = curr->vm_pgoff; - vma_start_write(curr); - remove = curr; - err = dup_anon_vma(next, curr, &anon_dup); - } - } - } - - /* Error in anon_vma clone. */ - if (err) - goto anon_vma_fail; - - if (vma_start < vma->vm_start || vma_end > vma->vm_end) - vma_expanded = true; - - if (vma_expanded) { - vma_iter_config(vmi, vma_start, vma_end); - } else { - vma_iter_config(vmi, adjust->vm_start + adj_start, - adjust->vm_end); - } - - if (vma_iter_prealloc(vmi, vma)) - goto prealloc_fail; - - init_multi_vma_prep(&vp, vma, adjust, remove, remove2); - VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && - vp.anon_vma != adjust->anon_vma); - - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start); - vma_set_range(vma, vma_start, vma_end, vma_pgoff); - - if (vma_expanded) - vma_iter_store(vmi, vma); - - if (adj_start) { - adjust->vm_start += adj_start; - adjust->vm_pgoff += adj_start >> PAGE_SHIFT; - if (adj_start < 0) { - WARN_ON(vma_expanded); - vma_iter_store(vmi, next); - } - } - - vma_complete(&vp, vmi, mm); - khugepaged_enter_vma(res, vm_flags); - return res; - -prealloc_fail: - if (anon_dup) - unlink_anon_vmas(anon_dup); - -anon_vma_fail: - vma_iter_set(vmi, addr); - vma_iter_load(vmi); - return NULL; -} - -/* - * Rough compatibility check to quickly see if it's even worth looking - * at sharing an anon_vma. - * - * They need to have the same vm_file, and the flags can only differ - * in things that mprotect may change. - * - * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that - * we can merge the two vma's. For example, we refuse to merge a vma if - * there is a vm_ops->close() function, because that indicates that the - * driver is doing some kind of reference counting. But that doesn't - * really matter for the anon_vma sharing case. - */ -static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) -{ - return a->vm_end == b->vm_start && - mpol_equal(vma_policy(a), vma_policy(b)) && - a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && - b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); -} - -/* - * Do some basic sanity checking to see if we can re-use the anon_vma - * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be - * the same as 'old', the other will be the new one that is trying - * to share the anon_vma. - * - * NOTE! This runs with mmap_lock held for reading, so it is possible that - * the anon_vma of 'old' is concurrently in the process of being set up - * by another page fault trying to merge _that_. But that's ok: if it - * is being set up, that automatically means that it will be a singleton - * acceptable for merging, so we can do all of this optimistically. But - * we do that READ_ONCE() to make sure that we never re-load the pointer. - * - * IOW: that the "list_is_singular()" test on the anon_vma_chain only - * matters for the 'stable anon_vma' case (ie the thing we want to avoid - * is to return an anon_vma that is "complex" due to having gone through - * a fork). - * - * We also make sure that the two vma's are compatible (adjacent, - * and with the same memory policies). That's all stable, even with just - * a read lock on the mmap_lock. - */ -static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) -{ - if (anon_vma_compatible(a, b)) { - struct anon_vma *anon_vma = READ_ONCE(old->anon_vma); - - if (anon_vma && list_is_singular(&old->anon_vma_chain)) - return anon_vma; - } - return NULL; -} - -/* - * find_mergeable_anon_vma is used by anon_vma_prepare, to check - * neighbouring vmas for a suitable anon_vma, before it goes off - * to allocate a new anon_vma. It checks because a repetitive - * sequence of mprotects and faults may otherwise lead to distinct - * anon_vmas being allocated, preventing vma merge in subsequent - * mprotect. - */ -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) -{ - struct anon_vma *anon_vma = NULL; - struct vm_area_struct *prev, *next; - VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_end); - - /* Try next first. */ - next = vma_iter_load(&vmi); - if (next) { - anon_vma = reusable_anon_vma(next, vma, next); - if (anon_vma) - return anon_vma; - } - - prev = vma_prev(&vmi); - VM_BUG_ON_VMA(prev != vma, vma); - prev = vma_prev(&vmi); - /* Try prev next. */ - if (prev) - anon_vma = reusable_anon_vma(prev, prev, vma); - - /* - * We might reach here with anon_vma == NULL if we can't find - * any reusable anon_vma. - * There's no absolute need to look only at touching neighbours: - * we could search further afield for "compatible" anon_vmas. - * But it would probably just be a waste of time searching, - * or lead to too many vmas hanging off the same anon_vma. - * We're trying to allow mprotect remerging later on, - * not trying to minimize memory used for anon_vmas. - */ - return anon_vma; -} - /* * If a hint addr is less than mmap_min_addr change hint to be as * low as possible but still greater than mmap_min_addr @@ -1549,85 +576,6 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg) } #endif /* __ARCH_WANT_SYS_OLD_MMAP */ -static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) -{ - return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); -} - -static bool vma_is_shared_writable(struct vm_area_struct *vma) -{ - return (vma->vm_flags & (VM_WRITE | VM_SHARED)) == - (VM_WRITE | VM_SHARED); -} - -static bool vma_fs_can_writeback(struct vm_area_struct *vma) -{ - /* No managed pages to writeback. */ - if (vma->vm_flags & VM_PFNMAP) - return false; - - return vma->vm_file && vma->vm_file->f_mapping && - mapping_can_writeback(vma->vm_file->f_mapping); -} - -/* - * Does this VMA require the underlying folios to have their dirty state - * tracked? - */ -bool vma_needs_dirty_tracking(struct vm_area_struct *vma) -{ - /* Only shared, writable VMAs require dirty tracking. */ - if (!vma_is_shared_writable(vma)) - return false; - - /* Does the filesystem need to be notified? */ - if (vm_ops_needs_writenotify(vma->vm_ops)) - return true; - - /* - * Even if the filesystem doesn't indicate a need for writenotify, if it - * can writeback, dirty tracking is still required. - */ - return vma_fs_can_writeback(vma); -} - -/* - * Some shared mappings will want the pages marked read-only - * to track write events. If so, we'll downgrade vm_page_prot - * to the private version (using protection_map[] without the - * VM_SHARED bit). - */ -bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) -{ - /* If it was private or non-writable, the write bit is already clear */ - if (!vma_is_shared_writable(vma)) - return false; - - /* The backer wishes to know when pages are first written to? */ - if (vm_ops_needs_writenotify(vma->vm_ops)) - return true; - - /* The open routine did something to the protections that pgprot_modify - * won't preserve? */ - if (pgprot_val(vm_page_prot) != - pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags))) - return false; - - /* - * Do we need to track softdirty? hugetlb does not support softdirty - * tracking yet. - */ - if (vma_soft_dirty_enabled(vma) && !is_vm_hugetlb_page(vma)) - return true; - - /* Do we need write faults for uffd-wp tracking? */ - if (userfaultfd_wp(vma)) - return true; - - /* Can the mapping track the dirty pages? */ - return vma_fs_can_writeback(vma); -} - /* * We account for memory if it's a private writeable mapping, * not hugepages and VM_NORESERVE wasn't set. @@ -2393,443 +1341,6 @@ success: return vma; } -/* - * Ok - we have the memory areas we should free on a maple tree so release them, - * and do the vma updates. - * - * Called with the mm semaphore held. - */ -static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas) -{ - unsigned long nr_accounted = 0; - struct vm_area_struct *vma; - - /* Update high watermark before we lower total_vm */ - update_hiwater_vm(mm); - mas_for_each(mas, vma, ULONG_MAX) { - long nrpages = vma_pages(vma); - - if (vma->vm_flags & VM_ACCOUNT) - nr_accounted += nrpages; - vm_stat_account(mm, vma->vm_flags, -nrpages); - remove_vma(vma, false); - } - vm_unacct_memory(nr_accounted); -} - -/* - * Get rid of page table information in the indicated region. - * - * Called with the mm semaphore held. - */ -static void unmap_region(struct mm_struct *mm, struct ma_state *mas, - struct vm_area_struct *vma, struct vm_area_struct *prev, - struct vm_area_struct *next, unsigned long start, - unsigned long end, unsigned long tree_end, bool mm_wr_locked) -{ - struct mmu_gather tlb; - unsigned long mt_start = mas->index; - - lru_add_drain(); - tlb_gather_mmu(&tlb, mm); - update_hiwater_rss(mm); - unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked); - mas_set(mas, mt_start); - free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, - next ? next->vm_start : USER_PGTABLES_CEILING, - mm_wr_locked); - tlb_finish_mmu(&tlb); -} - -/* - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it - * has already been checked or doesn't make sense to fail. - * VMA Iterator will point to the end VMA. - */ -static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long addr, int new_below) -{ - struct vma_prepare vp; - struct vm_area_struct *new; - int err; - - WARN_ON(vma->vm_start >= addr); - WARN_ON(vma->vm_end <= addr); - - if (vma->vm_ops && vma->vm_ops->may_split) { - err = vma->vm_ops->may_split(vma, addr); - if (err) - return err; - } - - new = vm_area_dup(vma); - if (!new) - return -ENOMEM; - - if (new_below) { - new->vm_end = addr; - } else { - new->vm_start = addr; - new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT); - } - - err = -ENOMEM; - vma_iter_config(vmi, new->vm_start, new->vm_end); - if (vma_iter_prealloc(vmi, new)) - goto out_free_vma; - - err = vma_dup_policy(vma, new); - if (err) - goto out_free_vmi; - - err = anon_vma_clone(new, vma); - if (err) - goto out_free_mpol; - - if (new->vm_file) - get_file(new->vm_file); - - if (new->vm_ops && new->vm_ops->open) - new->vm_ops->open(new); - - vma_start_write(vma); - vma_start_write(new); - - init_vma_prep(&vp, vma); - vp.insert = new; - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vma->vm_start, addr, 0); - - if (new_below) { - vma->vm_start = addr; - vma->vm_pgoff += (addr - new->vm_start) >> PAGE_SHIFT; - } else { - vma->vm_end = addr; - } - - /* vma_complete stores the new vma */ - vma_complete(&vp, vmi, vma->vm_mm); - - /* Success. */ - if (new_below) - vma_next(vmi); - return 0; - -out_free_mpol: - mpol_put(vma_policy(new)); -out_free_vmi: - vma_iter_free(vmi); -out_free_vma: - vm_area_free(new); - return err; -} - -/* - * Split a vma into two pieces at address 'addr', a new vma is allocated - * either for the first part or the tail. - */ -static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long addr, int new_below) -{ - if (vma->vm_mm->map_count >= sysctl_max_map_count) - return -ENOMEM; - - return __split_vma(vmi, vma, addr, new_below); -} - -/* - * We are about to modify one or multiple of a VMA's flags, policy, userfaultfd - * context and anonymous VMA name within the range [start, end). - * - * As a result, we might be able to merge the newly modified VMA range with an - * adjacent VMA with identical properties. - * - * If no merge is possible and the range does not span the entirety of the VMA, - * we then need to split the VMA to accommodate the change. - * - * The function returns either the merged VMA, the original VMA if a split was - * required instead, or an error if the split failed. - */ -struct vm_area_struct *vma_modify(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long vm_flags, - struct mempolicy *policy, - struct vm_userfaultfd_ctx uffd_ctx, - struct anon_vma_name *anon_name) -{ - pgoff_t pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); - struct vm_area_struct *merged; - - merged = vma_merge(vmi, prev, vma, start, end, vm_flags, - pgoff, policy, uffd_ctx, anon_name); - if (merged) - return merged; - - if (vma->vm_start < start) { - int err = split_vma(vmi, vma, start, 1); - - if (err) - return ERR_PTR(err); - } - - if (vma->vm_end > end) { - int err = split_vma(vmi, vma, end, 0); - - if (err) - return ERR_PTR(err); - } - - return vma; -} - -/* - * Attempt to merge a newly mapped VMA with those adjacent to it. The caller - * must ensure that [start, end) does not overlap any existing VMA. - */ -static struct vm_area_struct -*vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, - struct vm_area_struct *vma, unsigned long start, - unsigned long end, pgoff_t pgoff) -{ - return vma_merge(vmi, prev, vma, start, end, vma->vm_flags, pgoff, - vma_policy(vma), vma->vm_userfaultfd_ctx, anon_vma_name(vma)); -} - -/* - * Expand vma by delta bytes, potentially merging with an immediately adjacent - * VMA with identical properties. - */ -struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, - struct vm_area_struct *vma, - unsigned long delta) -{ - pgoff_t pgoff = vma->vm_pgoff + vma_pages(vma); - - /* vma is specified as prev, so case 1 or 2 will apply. */ - return vma_merge(vmi, vma, vma, vma->vm_end, vma->vm_end + delta, - vma->vm_flags, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); -} - -/* - * do_vmi_align_munmap() - munmap the aligned region from @start to @end. - * @vmi: The vma iterator - * @vma: The starting vm_area_struct - * @mm: The mm_struct - * @start: The aligned start address to munmap. - * @end: The aligned end address to munmap. - * @uf: The userfaultfd list_head - * @unlock: Set to true to drop the mmap_lock. unlocking only happens on - * success. - * - * Return: 0 on success and drops the lock if so directed, error and leaves the - * lock held otherwise. - */ -static int -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, - struct mm_struct *mm, unsigned long start, - unsigned long end, struct list_head *uf, bool unlock) -{ - struct vm_area_struct *prev, *next = NULL; - struct maple_tree mt_detach; - int count = 0; - int error = -ENOMEM; - unsigned long locked_vm = 0; - MA_STATE(mas_detach, &mt_detach, 0, 0); - mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); - mt_on_stack(mt_detach); - - /* - * If we need to split any vma, do it now to save pain later. - * - * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially - * unmapped vm_area_struct will remain in use: so lower split_vma - * places tmp vma above, and higher split_vma places tmp vma below. - */ - - /* Does it split the first one? */ - if (start > vma->vm_start) { - - /* - * Make sure that map_count on return from munmap() will - * not exceed its limit; but let map_count go just above - * its limit temporarily, to help free resources as expected. - */ - if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) - goto map_count_exceeded; - - error = __split_vma(vmi, vma, start, 1); - if (error) - goto start_split_failed; - } - - /* - * Detach a range of VMAs from the mm. Using next as a temp variable as - * it is always overwritten. - */ - next = vma; - do { - /* Does it split the end? */ - if (next->vm_end > end) { - error = __split_vma(vmi, next, end, 0); - if (error) - goto end_split_failed; - } - vma_start_write(next); - mas_set(&mas_detach, count); - error = mas_store_gfp(&mas_detach, next, GFP_KERNEL); - if (error) - goto munmap_gather_failed; - vma_mark_detached(next, true); - if (next->vm_flags & VM_LOCKED) - locked_vm += vma_pages(next); - - count++; - if (unlikely(uf)) { - /* - * If userfaultfd_unmap_prep returns an error the vmas - * will remain split, but userland will get a - * highly unexpected error anyway. This is no - * different than the case where the first of the two - * __split_vma fails, but we don't undo the first - * split, despite we could. This is unlikely enough - * failure that it's not worth optimizing it for. - */ - error = userfaultfd_unmap_prep(next, start, end, uf); - - if (error) - goto userfaultfd_error; - } -#ifdef CONFIG_DEBUG_VM_MAPLE_TREE - BUG_ON(next->vm_start < start); - BUG_ON(next->vm_start > end); -#endif - } for_each_vma_range(*vmi, next, end); - -#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) - /* Make sure no VMAs are about to be lost. */ - { - MA_STATE(test, &mt_detach, 0, 0); - struct vm_area_struct *vma_mas, *vma_test; - int test_count = 0; - - vma_iter_set(vmi, start); - rcu_read_lock(); - vma_test = mas_find(&test, count - 1); - for_each_vma_range(*vmi, vma_mas, end) { - BUG_ON(vma_mas != vma_test); - test_count++; - vma_test = mas_next(&test, count - 1); - } - rcu_read_unlock(); - BUG_ON(count != test_count); - } -#endif - - while (vma_iter_addr(vmi) > start) - vma_iter_prev_range(vmi); - - error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); - if (error) - goto clear_tree_failed; - - /* Point of no return */ - mm->locked_vm -= locked_vm; - mm->map_count -= count; - if (unlock) - mmap_write_downgrade(mm); - - prev = vma_iter_prev_range(vmi); - next = vma_next(vmi); - if (next) - vma_iter_prev_range(vmi); - - /* - * We can free page tables without write-locking mmap_lock because VMAs - * were isolated before we downgraded mmap_lock. - */ - mas_set(&mas_detach, 1); - unmap_region(mm, &mas_detach, vma, prev, next, start, end, count, - !unlock); - /* Statistics and freeing VMAs */ - mas_set(&mas_detach, 0); - remove_mt(mm, &mas_detach); - validate_mm(mm); - if (unlock) - mmap_read_unlock(mm); - - __mt_destroy(&mt_detach); - return 0; - -clear_tree_failed: -userfaultfd_error: -munmap_gather_failed: -end_split_failed: - mas_set(&mas_detach, 0); - mas_for_each(&mas_detach, next, end) - vma_mark_detached(next, false); - - __mt_destroy(&mt_detach); -start_split_failed: -map_count_exceeded: - validate_mm(mm); - return error; -} - -/* - * do_vmi_munmap() - munmap a given range. - * @vmi: The vma iterator - * @mm: The mm_struct - * @start: The start address to munmap - * @len: The length of the range to munmap - * @uf: The userfaultfd list_head - * @unlock: set to true if the user wants to drop the mmap_lock on success - * - * This function takes a @mas that is either pointing to the previous VMA or set - * to MA_START and sets it up to remove the mapping(s). The @len will be - * aligned and any arch_unmap work will be preformed. - * - * Return: 0 on success and drops the lock if so directed, error and leaves the - * lock held otherwise. - */ -int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, - unsigned long start, size_t len, struct list_head *uf, - bool unlock) -{ - unsigned long end; - struct vm_area_struct *vma; - - if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) - return -EINVAL; - - end = start + PAGE_ALIGN(len); - if (end == start) - return -EINVAL; - - /* - * Check if memory is sealed before arch_unmap. - * Prevent unmapping a sealed VMA. - * can_modify_mm assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm(mm, start, end))) - return -EPERM; - - /* arch_unmap() might do unmaps itself. */ - arch_unmap(mm, start, end); - - /* Find the first overlapping VMA */ - vma = vma_find(vmi, end); - if (!vma) { - if (unlock) - mmap_write_unlock(mm); - return 0; - } - - return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); -} - /* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls. * @mm: The mm_struct * @start: The start address to munmap @@ -3490,92 +2001,6 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) return 0; } -/* - * Copy the vma structure to a new location in the same mm, - * prior to moving page table entries, to effect an mremap move. - */ -struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, - unsigned long addr, unsigned long len, pgoff_t pgoff, - bool *need_rmap_locks) -{ - struct vm_area_struct *vma = *vmap; - unsigned long vma_start = vma->vm_start; - struct mm_struct *mm = vma->vm_mm; - struct vm_area_struct *new_vma, *prev; - bool faulted_in_anon_vma = true; - VMA_ITERATOR(vmi, mm, addr); - - /* - * If anonymous vma has not yet been faulted, update new pgoff - * to match new location, to increase its chance of merging. - */ - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { - pgoff = addr >> PAGE_SHIFT; - faulted_in_anon_vma = false; - } - - new_vma = find_vma_prev(mm, addr, &prev); - if (new_vma && new_vma->vm_start < addr + len) - return NULL; /* should never get here */ - - new_vma = vma_merge_new_vma(&vmi, prev, vma, addr, addr + len, pgoff); - if (new_vma) { - /* - * Source vma may have been merged into new_vma - */ - if (unlikely(vma_start >= new_vma->vm_start && - vma_start < new_vma->vm_end)) { - /* - * The only way we can get a vma_merge with - * self during an mremap is if the vma hasn't - * been faulted in yet and we were allowed to - * reset the dst vma->vm_pgoff to the - * destination address of the mremap to allow - * the merge to happen. mremap must change the - * vm_pgoff linearity between src and dst vmas - * (in turn preventing a vma_merge) to be - * safe. It is only safe to keep the vm_pgoff - * linear if there are no pages mapped yet. - */ - VM_BUG_ON_VMA(faulted_in_anon_vma, new_vma); - *vmap = vma = new_vma; - } - *need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff); - } else { - new_vma = vm_area_dup(vma); - if (!new_vma) - goto out; - vma_set_range(new_vma, addr, addr + len, pgoff); - if (vma_dup_policy(vma, new_vma)) - goto out_free_vma; - if (anon_vma_clone(new_vma, vma)) - goto out_free_mempol; - if (new_vma->vm_file) - get_file(new_vma->vm_file); - if (new_vma->vm_ops && new_vma->vm_ops->open) - new_vma->vm_ops->open(new_vma); - if (vma_link(mm, new_vma)) - goto out_vma_link; - *need_rmap_locks = false; - } - return new_vma; - -out_vma_link: - if (new_vma->vm_ops && new_vma->vm_ops->close) - new_vma->vm_ops->close(new_vma); - - if (new_vma->vm_file) - fput(new_vma->vm_file); - - unlink_anon_vmas(new_vma); -out_free_mempol: - mpol_put(vma_policy(new_vma)); -out_free_vma: - vm_area_free(new_vma); -out: - return NULL; -} - /* * Return true if the calling process may expand its vm space by the passed * number of pages @@ -3773,203 +2198,6 @@ int install_special_mapping(struct mm_struct *mm, return PTR_ERR_OR_ZERO(vma); } -static DEFINE_MUTEX(mm_all_locks_mutex); - -static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) -{ - if (!test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_root.rb_node)) { - /* - * The LSB of head.next can't change from under us - * because we hold the mm_all_locks_mutex. - */ - down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_lock); - /* - * We can safely modify head.next after taking the - * anon_vma->root->rwsem. If some other vma in this mm shares - * the same anon_vma we won't take it again. - * - * No need of atomic instructions here, head.next - * can't change from under us thanks to the - * anon_vma->root->rwsem. - */ - if (__test_and_set_bit(0, (unsigned long *) - &anon_vma->root->rb_root.rb_root.rb_node)) - BUG(); - } -} - -static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) -{ - if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) { - /* - * AS_MM_ALL_LOCKS can't change from under us because - * we hold the mm_all_locks_mutex. - * - * Operations on ->flags have to be atomic because - * even if AS_MM_ALL_LOCKS is stable thanks to the - * mm_all_locks_mutex, there may be other cpus - * changing other bitflags in parallel to us. - */ - if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags)) - BUG(); - down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_lock); - } -} - -/* - * This operation locks against the VM for all pte/vma/mm related - * operations that could ever happen on a certain mm. This includes - * vmtruncate, try_to_unmap, and all page faults. - * - * The caller must take the mmap_lock in write mode before calling - * mm_take_all_locks(). The caller isn't allowed to release the - * mmap_lock until mm_drop_all_locks() returns. - * - * mmap_lock in write mode is required in order to block all operations - * that could modify pagetables and free pages without need of - * altering the vma layout. It's also needed in write mode to avoid new - * anon_vmas to be associated with existing vmas. - * - * A single task can't take more than one mm_take_all_locks() in a row - * or it would deadlock. - * - * The LSB in anon_vma->rb_root.rb_node and the AS_MM_ALL_LOCKS bitflag in - * mapping->flags avoid to take the same lock twice, if more than one - * vma in this mm is backed by the same anon_vma or address_space. - * - * We take locks in following order, accordingly to comment at beginning - * of mm/rmap.c: - * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for - * hugetlb mapping); - * - all vmas marked locked - * - all i_mmap_rwsem locks; - * - all anon_vma->rwseml - * - * We can take all locks within these types randomly because the VM code - * doesn't nest them and we protected from parallel mm_take_all_locks() by - * mm_all_locks_mutex. - * - * mm_take_all_locks() and mm_drop_all_locks are expensive operations - * that may have to take thousand of locks. - * - * mm_take_all_locks() can fail if it's interrupted by signals. - */ -int mm_take_all_locks(struct mm_struct *mm) -{ - struct vm_area_struct *vma; - struct anon_vma_chain *avc; - VMA_ITERATOR(vmi, mm, 0); - - mmap_assert_write_locked(mm); - - mutex_lock(&mm_all_locks_mutex); - - /* - * vma_start_write() does not have a complement in mm_drop_all_locks() - * because vma_start_write() is always asymmetrical; it marks a VMA as - * being written to until mmap_write_unlock() or mmap_write_downgrade() - * is reached. - */ - for_each_vma(vmi, vma) { - if (signal_pending(current)) - goto out_unlock; - vma_start_write(vma); - } - - vma_iter_init(&vmi, mm, 0); - for_each_vma(vmi, vma) { - if (signal_pending(current)) - goto out_unlock; - if (vma->vm_file && vma->vm_file->f_mapping && - is_vm_hugetlb_page(vma)) - vm_lock_mapping(mm, vma->vm_file->f_mapping); - } - - vma_iter_init(&vmi, mm, 0); - for_each_vma(vmi, vma) { - if (signal_pending(current)) - goto out_unlock; - if (vma->vm_file && vma->vm_file->f_mapping && - !is_vm_hugetlb_page(vma)) - vm_lock_mapping(mm, vma->vm_file->f_mapping); - } - - vma_iter_init(&vmi, mm, 0); - for_each_vma(vmi, vma) { - if (signal_pending(current)) - goto out_unlock; - if (vma->anon_vma) - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) - vm_lock_anon_vma(mm, avc->anon_vma); - } - - return 0; - -out_unlock: - mm_drop_all_locks(mm); - return -EINTR; -} - -static void vm_unlock_anon_vma(struct anon_vma *anon_vma) -{ - if (test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_root.rb_node)) { - /* - * The LSB of head.next can't change to 0 from under - * us because we hold the mm_all_locks_mutex. - * - * We must however clear the bitflag before unlocking - * the vma so the users using the anon_vma->rb_root will - * never see our bitflag. - * - * No need of atomic instructions here, head.next - * can't change from under us until we release the - * anon_vma->root->rwsem. - */ - if (!__test_and_clear_bit(0, (unsigned long *) - &anon_vma->root->rb_root.rb_root.rb_node)) - BUG(); - anon_vma_unlock_write(anon_vma); - } -} - -static void vm_unlock_mapping(struct address_space *mapping) -{ - if (test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) { - /* - * AS_MM_ALL_LOCKS can't change to 0 from under us - * because we hold the mm_all_locks_mutex. - */ - i_mmap_unlock_write(mapping); - if (!test_and_clear_bit(AS_MM_ALL_LOCKS, - &mapping->flags)) - BUG(); - } -} - -/* - * The mmap_lock cannot be released by the caller until - * mm_drop_all_locks() returns. - */ -void mm_drop_all_locks(struct mm_struct *mm) -{ - struct vm_area_struct *vma; - struct anon_vma_chain *avc; - VMA_ITERATOR(vmi, mm, 0); - - mmap_assert_write_locked(mm); - BUG_ON(!mutex_is_locked(&mm_all_locks_mutex)); - - for_each_vma(vmi, vma) { - if (vma->anon_vma) - list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) - vm_unlock_anon_vma(avc->anon_vma); - if (vma->vm_file && vma->vm_file->f_mapping) - vm_unlock_mapping(vma->vm_file->f_mapping); - } - - mutex_unlock(&mm_all_locks_mutex); -} - /* * initialise the percpu counter for VM */ diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 8982e6139d07..fc18fe274505 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -19,6 +19,8 @@ #include #include +#include "vma.h" + /* global SRCU for all MMs */ DEFINE_STATIC_SRCU(srcu); diff --git a/mm/vma.c b/mm/vma.c new file mode 100644 index 000000000000..bf0546fe6eab --- /dev/null +++ b/mm/vma.c @@ -0,0 +1,1766 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* + * VMA-specific functions. + */ + +#include "vma_internal.h" +#include "vma.h" + +/* + * If the vma has a ->close operation then the driver probably needs to release + * per-vma resources, so we don't attempt to merge those if the caller indicates + * the current vma may be removed as part of the merge. + */ +static inline bool is_mergeable_vma(struct vm_area_struct *vma, + struct file *file, unsigned long vm_flags, + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name, bool may_remove_vma) +{ + /* + * VM_SOFTDIRTY should not prevent from VMA merging, if we + * match the flags but dirty bit -- the caller should mark + * merged VMA as dirty. If dirty bit won't be excluded from + * comparison, we increase pressure on the memory system forcing + * the kernel to generate new VMAs when old one could be + * extended instead. + */ + if ((vma->vm_flags ^ vm_flags) & ~VM_SOFTDIRTY) + return false; + if (vma->vm_file != file) + return false; + if (may_remove_vma && vma->vm_ops && vma->vm_ops->close) + return false; + if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx)) + return false; + if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) + return false; + return true; +} + +static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, + struct anon_vma *anon_vma2, struct vm_area_struct *vma) +{ + /* + * The list_is_singular() test is to avoid merging VMA cloned from + * parents. This can improve scalability caused by anon_vma lock. + */ + if ((!anon_vma1 || !anon_vma2) && (!vma || + list_is_singular(&vma->anon_vma_chain))) + return true; + return anon_vma1 == anon_vma2; +} + +/* + * init_multi_vma_prep() - Initializer for struct vma_prepare + * @vp: The vma_prepare struct + * @vma: The vma that will be altered once locked + * @next: The next vma if it is to be adjusted + * @remove: The first vma to be removed + * @remove2: The second vma to be removed + */ +static void init_multi_vma_prep(struct vma_prepare *vp, + struct vm_area_struct *vma, + struct vm_area_struct *next, + struct vm_area_struct *remove, + struct vm_area_struct *remove2) +{ + memset(vp, 0, sizeof(struct vma_prepare)); + vp->vma = vma; + vp->anon_vma = vma->anon_vma; + vp->remove = remove; + vp->remove2 = remove2; + vp->adj_next = next; + if (!vp->anon_vma && next) + vp->anon_vma = next->anon_vma; + + vp->file = vma->vm_file; + if (vp->file) + vp->mapping = vma->vm_file->f_mapping; + +} + +/* + * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) + * in front of (at a lower virtual address and file offset than) the vma. + * + * We cannot merge two vmas if they have differently assigned (non-NULL) + * anon_vmas, nor if same anon_vma is assigned but offsets incompatible. + * + * We don't check here for the merged mmap wrapping around the end of pagecache + * indices (16TB on ia32) because do_mmap() does not permit mmap's which + * wrap, nor mmaps which cover the final page at index -1UL. + * + * We assume the vma may be removed as part of the merge. + */ +bool +can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, + struct anon_vma *anon_vma, struct file *file, + pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name) +{ + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, true) && + is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { + if (vma->vm_pgoff == vm_pgoff) + return true; + } + return false; +} + +/* + * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) + * beyond (at a higher virtual address and file offset than) the vma. + * + * We cannot merge two vmas if they have differently assigned (non-NULL) + * anon_vmas, nor if same anon_vma is assigned but offsets incompatible. + * + * We assume that vma is not removed as part of the merge. + */ +bool +can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, + struct anon_vma *anon_vma, struct file *file, + pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name) +{ + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, false) && + is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { + pgoff_t vm_pglen; + + vm_pglen = vma_pages(vma); + if (vma->vm_pgoff + vm_pglen == vm_pgoff) + return true; + } + return false; +} + +/* + * Close a vm structure and free it. + */ +void remove_vma(struct vm_area_struct *vma, bool unreachable) +{ + might_sleep(); + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + if (vma->vm_file) + fput(vma->vm_file); + mpol_put(vma_policy(vma)); + if (unreachable) + __vm_area_free(vma); + else + vm_area_free(vma); +} + +/* + * Get rid of page table information in the indicated region. + * + * Called with the mm semaphore held. + */ +void unmap_region(struct mm_struct *mm, struct ma_state *mas, + struct vm_area_struct *vma, struct vm_area_struct *prev, + struct vm_area_struct *next, unsigned long start, + unsigned long end, unsigned long tree_end, bool mm_wr_locked) +{ + struct mmu_gather tlb; + unsigned long mt_start = mas->index; + + lru_add_drain(); + tlb_gather_mmu(&tlb, mm); + update_hiwater_rss(mm); + unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked); + mas_set(mas, mt_start); + free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, + next ? next->vm_start : USER_PGTABLES_CEILING, + mm_wr_locked); + tlb_finish_mmu(&tlb); +} + +/* + * __split_vma() bypasses sysctl_max_map_count checking. We use this where it + * has already been checked or doesn't make sense to fail. + * VMA Iterator will point to the end VMA. + */ +static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long addr, int new_below) +{ + struct vma_prepare vp; + struct vm_area_struct *new; + int err; + + WARN_ON(vma->vm_start >= addr); + WARN_ON(vma->vm_end <= addr); + + if (vma->vm_ops && vma->vm_ops->may_split) { + err = vma->vm_ops->may_split(vma, addr); + if (err) + return err; + } + + new = vm_area_dup(vma); + if (!new) + return -ENOMEM; + + if (new_below) { + new->vm_end = addr; + } else { + new->vm_start = addr; + new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT); + } + + err = -ENOMEM; + vma_iter_config(vmi, new->vm_start, new->vm_end); + if (vma_iter_prealloc(vmi, new)) + goto out_free_vma; + + err = vma_dup_policy(vma, new); + if (err) + goto out_free_vmi; + + err = anon_vma_clone(new, vma); + if (err) + goto out_free_mpol; + + if (new->vm_file) + get_file(new->vm_file); + + if (new->vm_ops && new->vm_ops->open) + new->vm_ops->open(new); + + vma_start_write(vma); + vma_start_write(new); + + init_vma_prep(&vp, vma); + vp.insert = new; + vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma->vm_start, addr, 0); + + if (new_below) { + vma->vm_start = addr; + vma->vm_pgoff += (addr - new->vm_start) >> PAGE_SHIFT; + } else { + vma->vm_end = addr; + } + + /* vma_complete stores the new vma */ + vma_complete(&vp, vmi, vma->vm_mm); + + /* Success. */ + if (new_below) + vma_next(vmi); + return 0; + +out_free_mpol: + mpol_put(vma_policy(new)); +out_free_vmi: + vma_iter_free(vmi); +out_free_vma: + vm_area_free(new); + return err; +} + +/* + * Split a vma into two pieces at address 'addr', a new vma is allocated + * either for the first part or the tail. + */ +static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long addr, int new_below) +{ + if (vma->vm_mm->map_count >= sysctl_max_map_count) + return -ENOMEM; + + return __split_vma(vmi, vma, addr, new_below); +} + +/* + * Ok - we have the memory areas we should free on a maple tree so release them, + * and do the vma updates. + * + * Called with the mm semaphore held. + */ +static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas) +{ + unsigned long nr_accounted = 0; + struct vm_area_struct *vma; + + /* Update high watermark before we lower total_vm */ + update_hiwater_vm(mm); + mas_for_each(mas, vma, ULONG_MAX) { + long nrpages = vma_pages(vma); + + if (vma->vm_flags & VM_ACCOUNT) + nr_accounted += nrpages; + vm_stat_account(mm, vma->vm_flags, -nrpages); + remove_vma(vma, false); + } + vm_unacct_memory(nr_accounted); +} + +/* + * init_vma_prep() - Initializer wrapper for vma_prepare struct + * @vp: The vma_prepare struct + * @vma: The vma that will be altered once locked + */ +void init_vma_prep(struct vma_prepare *vp, + struct vm_area_struct *vma) +{ + init_multi_vma_prep(vp, vma, NULL, NULL, NULL); +} + +/* + * Requires inode->i_mapping->i_mmap_rwsem + */ +static void __remove_shared_vm_struct(struct vm_area_struct *vma, + struct address_space *mapping) +{ + if (vma_is_shared_maywrite(vma)) + mapping_unmap_writable(mapping); + + flush_dcache_mmap_lock(mapping); + vma_interval_tree_remove(vma, &mapping->i_mmap); + flush_dcache_mmap_unlock(mapping); +} + +/* + * vma has some anon_vma assigned, and is already inserted on that + * anon_vma's interval trees. + * + * Before updating the vma's vm_start / vm_end / vm_pgoff fields, the + * vma must be removed from the anon_vma's interval trees using + * anon_vma_interval_tree_pre_update_vma(). + * + * After the update, the vma will be reinserted using + * anon_vma_interval_tree_post_update_vma(). + * + * The entire update must be protected by exclusive mmap_lock and by + * the root anon_vma's mutex. + */ +void +anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma) +{ + struct anon_vma_chain *avc; + + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) + anon_vma_interval_tree_remove(avc, &avc->anon_vma->rb_root); +} + +void +anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) +{ + struct anon_vma_chain *avc; + + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) + anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root); +} + +static void __vma_link_file(struct vm_area_struct *vma, + struct address_space *mapping) +{ + if (vma_is_shared_maywrite(vma)) + mapping_allow_writable(mapping); + + flush_dcache_mmap_lock(mapping); + vma_interval_tree_insert(vma, &mapping->i_mmap); + flush_dcache_mmap_unlock(mapping); +} + +/* + * vma_prepare() - Helper function for handling locking VMAs prior to altering + * @vp: The initialized vma_prepare struct + */ +void vma_prepare(struct vma_prepare *vp) +{ + if (vp->file) { + uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); + + if (vp->adj_next) + uprobe_munmap(vp->adj_next, vp->adj_next->vm_start, + vp->adj_next->vm_end); + + i_mmap_lock_write(vp->mapping); + if (vp->insert && vp->insert->vm_file) { + /* + * Put into interval tree now, so instantiated pages + * are visible to arm/parisc __flush_dcache_page + * throughout; but we cannot insert into address + * space until vma start or end is updated. + */ + __vma_link_file(vp->insert, + vp->insert->vm_file->f_mapping); + } + } + + if (vp->anon_vma) { + anon_vma_lock_write(vp->anon_vma); + anon_vma_interval_tree_pre_update_vma(vp->vma); + if (vp->adj_next) + anon_vma_interval_tree_pre_update_vma(vp->adj_next); + } + + if (vp->file) { + flush_dcache_mmap_lock(vp->mapping); + vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap); + if (vp->adj_next) + vma_interval_tree_remove(vp->adj_next, + &vp->mapping->i_mmap); + } + +} + +/* + * dup_anon_vma() - Helper function to duplicate anon_vma + * @dst: The destination VMA + * @src: The source VMA + * @dup: Pointer to the destination VMA when successful. + * + * Returns: 0 on success. + */ +static int dup_anon_vma(struct vm_area_struct *dst, + struct vm_area_struct *src, struct vm_area_struct **dup) +{ + /* + * Easily overlooked: when mprotect shifts the boundary, make sure the + * expanding vma has anon_vma set if the shrinking vma had, to cover any + * anon pages imported. + */ + if (src->anon_vma && !dst->anon_vma) { + int ret; + + vma_assert_write_locked(dst); + dst->anon_vma = src->anon_vma; + ret = anon_vma_clone(dst, src); + if (ret) + return ret; + + *dup = dst; + } + + return 0; +} + +#ifdef CONFIG_DEBUG_VM_MAPLE_TREE +void validate_mm(struct mm_struct *mm) +{ + int bug = 0; + int i = 0; + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); + + mt_validate(&mm->mm_mt); + for_each_vma(vmi, vma) { +#ifdef CONFIG_DEBUG_VM_RB + struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma_chain *avc; +#endif + unsigned long vmi_start, vmi_end; + bool warn = 0; + + vmi_start = vma_iter_addr(&vmi); + vmi_end = vma_iter_end(&vmi); + if (VM_WARN_ON_ONCE_MM(vma->vm_end != vmi_end, mm)) + warn = 1; + + if (VM_WARN_ON_ONCE_MM(vma->vm_start != vmi_start, mm)) + warn = 1; + + if (warn) { + pr_emerg("issue in %s\n", current->comm); + dump_stack(); + dump_vma(vma); + pr_emerg("tree range: %px start %lx end %lx\n", vma, + vmi_start, vmi_end - 1); + vma_iter_dump_tree(&vmi); + } + +#ifdef CONFIG_DEBUG_VM_RB + if (anon_vma) { + anon_vma_lock_read(anon_vma); + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) + anon_vma_interval_tree_verify(avc); + anon_vma_unlock_read(anon_vma); + } +#endif + i++; + } + if (i != mm->map_count) { + pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i); + bug = 1; + } + VM_BUG_ON_MM(bug, mm); +} +#endif /* CONFIG_DEBUG_VM_MAPLE_TREE */ + +/* + * vma_expand - Expand an existing VMA + * + * @vmi: The vma iterator + * @vma: The vma to expand + * @start: The start of the vma + * @end: The exclusive end of the vma + * @pgoff: The page offset of vma + * @next: The current of next vma. + * + * Expand @vma to @start and @end. Can expand off the start and end. Will + * expand over @next if it's different from @vma and @end == @next->vm_end. + * Checking if the @vma can expand and merge with @next needs to be handled by + * the caller. + * + * Returns: 0 on success + */ +int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff, + struct vm_area_struct *next) +{ + struct vm_area_struct *anon_dup = NULL; + bool remove_next = false; + struct vma_prepare vp; + + vma_start_write(vma); + if (next && (vma != next) && (end == next->vm_end)) { + int ret; + + remove_next = true; + vma_start_write(next); + ret = dup_anon_vma(vma, next, &anon_dup); + if (ret) + return ret; + } + + init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); + /* Not merging but overwriting any part of next is not handled. */ + VM_WARN_ON(next && !vp.remove && + next != vma && end > next->vm_start); + /* Only handles expanding */ + VM_WARN_ON(vma->vm_start < start || vma->vm_end > end); + + /* Note: vma iterator must be pointing to 'start' */ + vma_iter_config(vmi, start, end); + if (vma_iter_prealloc(vmi, vma)) + goto nomem; + + vma_prepare(&vp); + vma_adjust_trans_huge(vma, start, end, 0); + vma_set_range(vma, start, end, pgoff); + vma_iter_store(vmi, vma); + + vma_complete(&vp, vmi, vma->vm_mm); + return 0; + +nomem: + if (anon_dup) + unlink_anon_vmas(anon_dup); + return -ENOMEM; +} + +/* + * vma_shrink() - Reduce an existing VMAs memory area + * @vmi: The vma iterator + * @vma: The VMA to modify + * @start: The new start + * @end: The new end + * + * Returns: 0 on success, -ENOMEM otherwise + */ +int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff) +{ + struct vma_prepare vp; + + WARN_ON((vma->vm_start != start) && (vma->vm_end != end)); + + if (vma->vm_start < start) + vma_iter_config(vmi, vma->vm_start, start); + else + vma_iter_config(vmi, end, vma->vm_end); + + if (vma_iter_prealloc(vmi, NULL)) + return -ENOMEM; + + vma_start_write(vma); + + init_vma_prep(&vp, vma); + vma_prepare(&vp); + vma_adjust_trans_huge(vma, start, end, 0); + + vma_iter_clear(vmi); + vma_set_range(vma, start, end, pgoff); + vma_complete(&vp, vmi, vma->vm_mm); + return 0; +} + +/* + * vma_complete- Helper function for handling the unlocking after altering VMAs, + * or for inserting a VMA. + * + * @vp: The vma_prepare struct + * @vmi: The vma iterator + * @mm: The mm_struct + */ +void vma_complete(struct vma_prepare *vp, + struct vma_iterator *vmi, struct mm_struct *mm) +{ + if (vp->file) { + if (vp->adj_next) + vma_interval_tree_insert(vp->adj_next, + &vp->mapping->i_mmap); + vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap); + flush_dcache_mmap_unlock(vp->mapping); + } + + if (vp->remove && vp->file) { + __remove_shared_vm_struct(vp->remove, vp->mapping); + if (vp->remove2) + __remove_shared_vm_struct(vp->remove2, vp->mapping); + } else if (vp->insert) { + /* + * split_vma has split insert from vma, and needs + * us to insert it before dropping the locks + * (it may either follow vma or precede it). + */ + vma_iter_store(vmi, vp->insert); + mm->map_count++; + } + + if (vp->anon_vma) { + anon_vma_interval_tree_post_update_vma(vp->vma); + if (vp->adj_next) + anon_vma_interval_tree_post_update_vma(vp->adj_next); + anon_vma_unlock_write(vp->anon_vma); + } + + if (vp->file) { + i_mmap_unlock_write(vp->mapping); + uprobe_mmap(vp->vma); + + if (vp->adj_next) + uprobe_mmap(vp->adj_next); + } + + if (vp->remove) { +again: + vma_mark_detached(vp->remove, true); + if (vp->file) { + uprobe_munmap(vp->remove, vp->remove->vm_start, + vp->remove->vm_end); + fput(vp->file); + } + if (vp->remove->anon_vma) + anon_vma_merge(vp->vma, vp->remove); + mm->map_count--; + mpol_put(vma_policy(vp->remove)); + if (!vp->remove2) + WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); + vm_area_free(vp->remove); + + /* + * In mprotect's case 6 (see comments on vma_merge), + * we are removing both mid and next vmas + */ + if (vp->remove2) { + vp->remove = vp->remove2; + vp->remove2 = NULL; + goto again; + } + } + if (vp->insert && vp->file) + uprobe_mmap(vp->insert); + validate_mm(mm); +} + +/* + * do_vmi_align_munmap() - munmap the aligned region from @start to @end. + * @vmi: The vma iterator + * @vma: The starting vm_area_struct + * @mm: The mm_struct + * @start: The aligned start address to munmap. + * @end: The aligned end address to munmap. + * @uf: The userfaultfd list_head + * @unlock: Set to true to drop the mmap_lock. unlocking only happens on + * success. + * + * Return: 0 on success and drops the lock if so directed, error and leaves the + * lock held otherwise. + */ +int +do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct mm_struct *mm, unsigned long start, + unsigned long end, struct list_head *uf, bool unlock) +{ + struct vm_area_struct *prev, *next = NULL; + struct maple_tree mt_detach; + int count = 0; + int error = -ENOMEM; + unsigned long locked_vm = 0; + MA_STATE(mas_detach, &mt_detach, 0, 0); + mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); + mt_on_stack(mt_detach); + + /* + * If we need to split any vma, do it now to save pain later. + * + * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially + * unmapped vm_area_struct will remain in use: so lower split_vma + * places tmp vma above, and higher split_vma places tmp vma below. + */ + + /* Does it split the first one? */ + if (start > vma->vm_start) { + + /* + * Make sure that map_count on return from munmap() will + * not exceed its limit; but let map_count go just above + * its limit temporarily, to help free resources as expected. + */ + if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) + goto map_count_exceeded; + + error = __split_vma(vmi, vma, start, 1); + if (error) + goto start_split_failed; + } + + /* + * Detach a range of VMAs from the mm. Using next as a temp variable as + * it is always overwritten. + */ + next = vma; + do { + /* Does it split the end? */ + if (next->vm_end > end) { + error = __split_vma(vmi, next, end, 0); + if (error) + goto end_split_failed; + } + vma_start_write(next); + mas_set(&mas_detach, count); + error = mas_store_gfp(&mas_detach, next, GFP_KERNEL); + if (error) + goto munmap_gather_failed; + vma_mark_detached(next, true); + if (next->vm_flags & VM_LOCKED) + locked_vm += vma_pages(next); + + count++; + if (unlikely(uf)) { + /* + * If userfaultfd_unmap_prep returns an error the vmas + * will remain split, but userland will get a + * highly unexpected error anyway. This is no + * different than the case where the first of the two + * __split_vma fails, but we don't undo the first + * split, despite we could. This is unlikely enough + * failure that it's not worth optimizing it for. + */ + error = userfaultfd_unmap_prep(next, start, end, uf); + + if (error) + goto userfaultfd_error; + } +#ifdef CONFIG_DEBUG_VM_MAPLE_TREE + BUG_ON(next->vm_start < start); + BUG_ON(next->vm_start > end); +#endif + } for_each_vma_range(*vmi, next, end); + +#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) + /* Make sure no VMAs are about to be lost. */ + { + MA_STATE(test, &mt_detach, 0, 0); + struct vm_area_struct *vma_mas, *vma_test; + int test_count = 0; + + vma_iter_set(vmi, start); + rcu_read_lock(); + vma_test = mas_find(&test, count - 1); + for_each_vma_range(*vmi, vma_mas, end) { + BUG_ON(vma_mas != vma_test); + test_count++; + vma_test = mas_next(&test, count - 1); + } + rcu_read_unlock(); + BUG_ON(count != test_count); + } +#endif + + while (vma_iter_addr(vmi) > start) + vma_iter_prev_range(vmi); + + error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); + if (error) + goto clear_tree_failed; + + /* Point of no return */ + mm->locked_vm -= locked_vm; + mm->map_count -= count; + if (unlock) + mmap_write_downgrade(mm); + + prev = vma_iter_prev_range(vmi); + next = vma_next(vmi); + if (next) + vma_iter_prev_range(vmi); + + /* + * We can free page tables without write-locking mmap_lock because VMAs + * were isolated before we downgraded mmap_lock. + */ + mas_set(&mas_detach, 1); + unmap_region(mm, &mas_detach, vma, prev, next, start, end, count, + !unlock); + /* Statistics and freeing VMAs */ + mas_set(&mas_detach, 0); + remove_mt(mm, &mas_detach); + validate_mm(mm); + if (unlock) + mmap_read_unlock(mm); + + __mt_destroy(&mt_detach); + return 0; + +clear_tree_failed: +userfaultfd_error: +munmap_gather_failed: +end_split_failed: + mas_set(&mas_detach, 0); + mas_for_each(&mas_detach, next, end) + vma_mark_detached(next, false); + + __mt_destroy(&mt_detach); +start_split_failed: +map_count_exceeded: + validate_mm(mm); + return error; +} + +/* + * do_vmi_munmap() - munmap a given range. + * @vmi: The vma iterator + * @mm: The mm_struct + * @start: The start address to munmap + * @len: The length of the range to munmap + * @uf: The userfaultfd list_head + * @unlock: set to true if the user wants to drop the mmap_lock on success + * + * This function takes a @mas that is either pointing to the previous VMA or set + * to MA_START and sets it up to remove the mapping(s). The @len will be + * aligned and any arch_unmap work will be preformed. + * + * Return: 0 on success and drops the lock if so directed, error and leaves the + * lock held otherwise. + */ +int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, + unsigned long start, size_t len, struct list_head *uf, + bool unlock) +{ + unsigned long end; + struct vm_area_struct *vma; + + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) + return -EINVAL; + + end = start + PAGE_ALIGN(len); + if (end == start) + return -EINVAL; + + /* + * Check if memory is sealed before arch_unmap. + * Prevent unmapping a sealed VMA. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (unlikely(!can_modify_mm(mm, start, end))) + return -EPERM; + + /* arch_unmap() might do unmaps itself. */ + arch_unmap(mm, start, end); + + /* Find the first overlapping VMA */ + vma = vma_find(vmi, end); + if (!vma) { + if (unlock) + mmap_write_unlock(mm); + return 0; + } + + return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); +} + +/* + * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), + * figure out whether that can be merged with its predecessor or its + * successor. Or both (it neatly fills a hole). + * + * In most cases - when called for mmap, brk or mremap - [addr,end) is + * certain not to be mapped by the time vma_merge is called; but when + * called for mprotect, it is certain to be already mapped (either at + * an offset within prev, or at the start of next), and the flags of + * this area are about to be changed to vm_flags - and the no-change + * case has already been eliminated. + * + * The following mprotect cases have to be considered, where **** is + * the area passed down from mprotect_fixup, never extending beyond one + * vma, PPPP is the previous vma, CCCC is a concurrent vma that starts + * at the same address as **** and is of the same or larger span, and + * NNNN the next vma after ****: + * + * **** **** **** + * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPCCCCCC + * cannot merge might become might become + * PPNNNNNNNNNN PPPPPPPPPPCC + * mmap, brk or case 4 below case 5 below + * mremap move: + * **** **** + * PPPP NNNN PPPPCCCCNNNN + * might become might become + * PPPPPPPPPPPP 1 or PPPPPPPPPPPP 6 or + * PPPPPPPPNNNN 2 or PPPPPPPPNNNN 7 or + * PPPPNNNNNNNN 3 PPPPNNNNNNNN 8 + * + * It is important for case 8 that the vma CCCC overlapping the + * region **** is never going to extended over NNNN. Instead NNNN must + * be extended in region **** and CCCC must be removed. This way in + * all cases where vma_merge succeeds, the moment vma_merge drops the + * rmap_locks, the properties of the merged vma will be already + * correct for the whole merged range. Some of those properties like + * vm_page_prot/vm_flags may be accessed by rmap_walks and they must + * be correct for the whole merged range immediately after the + * rmap_locks are released. Otherwise if NNNN would be removed and + * CCCC would be extended over the NNNN range, remove_migration_ptes + * or other rmap walkers (if working on addresses beyond the "end" + * parameter) may establish ptes with the wrong permissions of CCCC + * instead of the right permissions of NNNN. + * + * In the code below: + * PPPP is represented by *prev + * CCCC is represented by *curr or not represented at all (NULL) + * NNNN is represented by *next or not represented at all (NULL) + * **** is not represented - it will be merged and the vma containing the + * area is returned, or the function will return NULL + */ +static struct vm_area_struct +*vma_merge(struct vma_iterator *vmi, struct vm_area_struct *prev, + struct vm_area_struct *src, unsigned long addr, unsigned long end, + unsigned long vm_flags, pgoff_t pgoff, struct mempolicy *policy, + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name) +{ + struct mm_struct *mm = src->vm_mm; + struct anon_vma *anon_vma = src->anon_vma; + struct file *file = src->vm_file; + struct vm_area_struct *curr, *next, *res; + struct vm_area_struct *vma, *adjust, *remove, *remove2; + struct vm_area_struct *anon_dup = NULL; + struct vma_prepare vp; + pgoff_t vma_pgoff; + int err = 0; + bool merge_prev = false; + bool merge_next = false; + bool vma_expanded = false; + unsigned long vma_start = addr; + unsigned long vma_end = end; + pgoff_t pglen = (end - addr) >> PAGE_SHIFT; + long adj_start = 0; + + /* + * We later require that vma->vm_flags == vm_flags, + * so this tests vma->vm_flags & VM_SPECIAL, too. + */ + if (vm_flags & VM_SPECIAL) + return NULL; + + /* Does the input range span an existing VMA? (cases 5 - 8) */ + curr = find_vma_intersection(mm, prev ? prev->vm_end : 0, end); + + if (!curr || /* cases 1 - 4 */ + end == curr->vm_end) /* cases 6 - 8, adjacent VMA */ + next = vma_lookup(mm, end); + else + next = NULL; /* case 5 */ + + if (prev) { + vma_start = prev->vm_start; + vma_pgoff = prev->vm_pgoff; + + /* Can we merge the predecessor? */ + if (addr == prev->vm_end && mpol_equal(vma_policy(prev), policy) + && can_vma_merge_after(prev, vm_flags, anon_vma, file, + pgoff, vm_userfaultfd_ctx, anon_name)) { + merge_prev = true; + vma_prev(vmi); + } + } + + /* Can we merge the successor? */ + if (next && mpol_equal(policy, vma_policy(next)) && + can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, + vm_userfaultfd_ctx, anon_name)) { + merge_next = true; + } + + /* Verify some invariant that must be enforced by the caller. */ + VM_WARN_ON(prev && addr <= prev->vm_start); + VM_WARN_ON(curr && (addr != curr->vm_start || end > curr->vm_end)); + VM_WARN_ON(addr >= end); + + if (!merge_prev && !merge_next) + return NULL; /* Not mergeable. */ + + if (merge_prev) + vma_start_write(prev); + + res = vma = prev; + remove = remove2 = adjust = NULL; + + /* Can we merge both the predecessor and the successor? */ + if (merge_prev && merge_next && + is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { + vma_start_write(next); + remove = next; /* case 1 */ + vma_end = next->vm_end; + err = dup_anon_vma(prev, next, &anon_dup); + if (curr) { /* case 6 */ + vma_start_write(curr); + remove = curr; + remove2 = next; + /* + * Note that the dup_anon_vma below cannot overwrite err + * since the first caller would do nothing unless next + * has an anon_vma. + */ + if (!next->anon_vma) + err = dup_anon_vma(prev, curr, &anon_dup); + } + } else if (merge_prev) { /* case 2 */ + if (curr) { + vma_start_write(curr); + if (end == curr->vm_end) { /* case 7 */ + /* + * can_vma_merge_after() assumed we would not be + * removing prev vma, so it skipped the check + * for vm_ops->close, but we are removing curr + */ + if (curr->vm_ops && curr->vm_ops->close) + err = -EINVAL; + remove = curr; + } else { /* case 5 */ + adjust = curr; + adj_start = (end - curr->vm_start); + } + if (!err) + err = dup_anon_vma(prev, curr, &anon_dup); + } + } else { /* merge_next */ + vma_start_write(next); + res = next; + if (prev && addr < prev->vm_end) { /* case 4 */ + vma_start_write(prev); + vma_end = addr; + adjust = next; + adj_start = -(prev->vm_end - addr); + err = dup_anon_vma(next, prev, &anon_dup); + } else { + /* + * Note that cases 3 and 8 are the ONLY ones where prev + * is permitted to be (but is not necessarily) NULL. + */ + vma = next; /* case 3 */ + vma_start = addr; + vma_end = next->vm_end; + vma_pgoff = next->vm_pgoff - pglen; + if (curr) { /* case 8 */ + vma_pgoff = curr->vm_pgoff; + vma_start_write(curr); + remove = curr; + err = dup_anon_vma(next, curr, &anon_dup); + } + } + } + + /* Error in anon_vma clone. */ + if (err) + goto anon_vma_fail; + + if (vma_start < vma->vm_start || vma_end > vma->vm_end) + vma_expanded = true; + + if (vma_expanded) { + vma_iter_config(vmi, vma_start, vma_end); + } else { + vma_iter_config(vmi, adjust->vm_start + adj_start, + adjust->vm_end); + } + + if (vma_iter_prealloc(vmi, vma)) + goto prealloc_fail; + + init_multi_vma_prep(&vp, vma, adjust, remove, remove2); + VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && + vp.anon_vma != adjust->anon_vma); + + vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start); + vma_set_range(vma, vma_start, vma_end, vma_pgoff); + + if (vma_expanded) + vma_iter_store(vmi, vma); + + if (adj_start) { + adjust->vm_start += adj_start; + adjust->vm_pgoff += adj_start >> PAGE_SHIFT; + if (adj_start < 0) { + WARN_ON(vma_expanded); + vma_iter_store(vmi, next); + } + } + + vma_complete(&vp, vmi, mm); + khugepaged_enter_vma(res, vm_flags); + return res; + +prealloc_fail: + if (anon_dup) + unlink_anon_vmas(anon_dup); + +anon_vma_fail: + vma_iter_set(vmi, addr); + vma_iter_load(vmi); + return NULL; +} + +/* + * We are about to modify one or multiple of a VMA's flags, policy, userfaultfd + * context and anonymous VMA name within the range [start, end). + * + * As a result, we might be able to merge the newly modified VMA range with an + * adjacent VMA with identical properties. + * + * If no merge is possible and the range does not span the entirety of the VMA, + * we then need to split the VMA to accommodate the change. + * + * The function returns either the merged VMA, the original VMA if a split was + * required instead, or an error if the split failed. + */ +struct vm_area_struct *vma_modify(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long vm_flags, + struct mempolicy *policy, + struct vm_userfaultfd_ctx uffd_ctx, + struct anon_vma_name *anon_name) +{ + pgoff_t pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + struct vm_area_struct *merged; + + merged = vma_merge(vmi, prev, vma, start, end, vm_flags, + pgoff, policy, uffd_ctx, anon_name); + if (merged) + return merged; + + if (vma->vm_start < start) { + int err = split_vma(vmi, vma, start, 1); + + if (err) + return ERR_PTR(err); + } + + if (vma->vm_end > end) { + int err = split_vma(vmi, vma, end, 0); + + if (err) + return ERR_PTR(err); + } + + return vma; +} + +/* + * Attempt to merge a newly mapped VMA with those adjacent to it. The caller + * must ensure that [start, end) does not overlap any existing VMA. + */ +struct vm_area_struct +*vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, + struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgoff_t pgoff) +{ + return vma_merge(vmi, prev, vma, start, end, vma->vm_flags, pgoff, + vma_policy(vma), vma->vm_userfaultfd_ctx, anon_vma_name(vma)); +} + +/* + * Expand vma by delta bytes, potentially merging with an immediately adjacent + * VMA with identical properties. + */ +struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, + struct vm_area_struct *vma, + unsigned long delta) +{ + pgoff_t pgoff = vma->vm_pgoff + vma_pages(vma); + + /* vma is specified as prev, so case 1 or 2 will apply. */ + return vma_merge(vmi, vma, vma, vma->vm_end, vma->vm_end + delta, + vma->vm_flags, pgoff, vma_policy(vma), + vma->vm_userfaultfd_ctx, anon_vma_name(vma)); +} + +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) +{ + vb->count = 0; +} + +static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb) +{ + struct address_space *mapping; + int i; + + mapping = vb->vmas[0]->vm_file->f_mapping; + i_mmap_lock_write(mapping); + for (i = 0; i < vb->count; i++) { + VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping); + __remove_shared_vm_struct(vb->vmas[i], mapping); + } + i_mmap_unlock_write(mapping); + + unlink_file_vma_batch_init(vb); +} + +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb, + struct vm_area_struct *vma) +{ + if (vma->vm_file == NULL) + return; + + if ((vb->count > 0 && vb->vmas[0]->vm_file != vma->vm_file) || + vb->count == ARRAY_SIZE(vb->vmas)) + unlink_file_vma_batch_process(vb); + + vb->vmas[vb->count] = vma; + vb->count++; +} + +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb) +{ + if (vb->count > 0) + unlink_file_vma_batch_process(vb); +} + +/* + * Unlink a file-based vm structure from its interval tree, to hide + * vma from rmap and vmtruncate before freeing its page tables. + */ +void unlink_file_vma(struct vm_area_struct *vma) +{ + struct file *file = vma->vm_file; + + if (file) { + struct address_space *mapping = file->f_mapping; + + i_mmap_lock_write(mapping); + __remove_shared_vm_struct(vma, mapping); + i_mmap_unlock_write(mapping); + } +} + +void vma_link_file(struct vm_area_struct *vma) +{ + struct file *file = vma->vm_file; + struct address_space *mapping; + + if (file) { + mapping = file->f_mapping; + i_mmap_lock_write(mapping); + __vma_link_file(vma, mapping); + i_mmap_unlock_write(mapping); + } +} + +int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) +{ + VMA_ITERATOR(vmi, mm, 0); + + vma_iter_config(&vmi, vma->vm_start, vma->vm_end); + if (vma_iter_prealloc(&vmi, vma)) + return -ENOMEM; + + vma_start_write(vma); + vma_iter_store(&vmi, vma); + vma_link_file(vma); + mm->map_count++; + validate_mm(mm); + return 0; +} + +/* + * Copy the vma structure to a new location in the same mm, + * prior to moving page table entries, to effect an mremap move. + */ +struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, + unsigned long addr, unsigned long len, pgoff_t pgoff, + bool *need_rmap_locks) +{ + struct vm_area_struct *vma = *vmap; + unsigned long vma_start = vma->vm_start; + struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *new_vma, *prev; + bool faulted_in_anon_vma = true; + VMA_ITERATOR(vmi, mm, addr); + + /* + * If anonymous vma has not yet been faulted, update new pgoff + * to match new location, to increase its chance of merging. + */ + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { + pgoff = addr >> PAGE_SHIFT; + faulted_in_anon_vma = false; + } + + new_vma = find_vma_prev(mm, addr, &prev); + if (new_vma && new_vma->vm_start < addr + len) + return NULL; /* should never get here */ + + new_vma = vma_merge_new_vma(&vmi, prev, vma, addr, addr + len, pgoff); + if (new_vma) { + /* + * Source vma may have been merged into new_vma + */ + if (unlikely(vma_start >= new_vma->vm_start && + vma_start < new_vma->vm_end)) { + /* + * The only way we can get a vma_merge with + * self during an mremap is if the vma hasn't + * been faulted in yet and we were allowed to + * reset the dst vma->vm_pgoff to the + * destination address of the mremap to allow + * the merge to happen. mremap must change the + * vm_pgoff linearity between src and dst vmas + * (in turn preventing a vma_merge) to be + * safe. It is only safe to keep the vm_pgoff + * linear if there are no pages mapped yet. + */ + VM_BUG_ON_VMA(faulted_in_anon_vma, new_vma); + *vmap = vma = new_vma; + } + *need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff); + } else { + new_vma = vm_area_dup(vma); + if (!new_vma) + goto out; + vma_set_range(new_vma, addr, addr + len, pgoff); + if (vma_dup_policy(vma, new_vma)) + goto out_free_vma; + if (anon_vma_clone(new_vma, vma)) + goto out_free_mempol; + if (new_vma->vm_file) + get_file(new_vma->vm_file); + if (new_vma->vm_ops && new_vma->vm_ops->open) + new_vma->vm_ops->open(new_vma); + if (vma_link(mm, new_vma)) + goto out_vma_link; + *need_rmap_locks = false; + } + return new_vma; + +out_vma_link: + if (new_vma->vm_ops && new_vma->vm_ops->close) + new_vma->vm_ops->close(new_vma); + + if (new_vma->vm_file) + fput(new_vma->vm_file); + + unlink_anon_vmas(new_vma); +out_free_mempol: + mpol_put(vma_policy(new_vma)); +out_free_vma: + vm_area_free(new_vma); +out: + return NULL; +} + +/* + * Rough compatibility check to quickly see if it's even worth looking + * at sharing an anon_vma. + * + * They need to have the same vm_file, and the flags can only differ + * in things that mprotect may change. + * + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that + * we can merge the two vma's. For example, we refuse to merge a vma if + * there is a vm_ops->close() function, because that indicates that the + * driver is doing some kind of reference counting. But that doesn't + * really matter for the anon_vma sharing case. + */ +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) +{ + return a->vm_end == b->vm_start && + mpol_equal(vma_policy(a), vma_policy(b)) && + a->vm_file == b->vm_file && + !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); +} + +/* + * Do some basic sanity checking to see if we can re-use the anon_vma + * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be + * the same as 'old', the other will be the new one that is trying + * to share the anon_vma. + * + * NOTE! This runs with mmap_lock held for reading, so it is possible that + * the anon_vma of 'old' is concurrently in the process of being set up + * by another page fault trying to merge _that_. But that's ok: if it + * is being set up, that automatically means that it will be a singleton + * acceptable for merging, so we can do all of this optimistically. But + * we do that READ_ONCE() to make sure that we never re-load the pointer. + * + * IOW: that the "list_is_singular()" test on the anon_vma_chain only + * matters for the 'stable anon_vma' case (ie the thing we want to avoid + * is to return an anon_vma that is "complex" due to having gone through + * a fork). + * + * We also make sure that the two vma's are compatible (adjacent, + * and with the same memory policies). That's all stable, even with just + * a read lock on the mmap_lock. + */ +static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, + struct vm_area_struct *a, + struct vm_area_struct *b) +{ + if (anon_vma_compatible(a, b)) { + struct anon_vma *anon_vma = READ_ONCE(old->anon_vma); + + if (anon_vma && list_is_singular(&old->anon_vma_chain)) + return anon_vma; + } + return NULL; +} + +/* + * find_mergeable_anon_vma is used by anon_vma_prepare, to check + * neighbouring vmas for a suitable anon_vma, before it goes off + * to allocate a new anon_vma. It checks because a repetitive + * sequence of mprotects and faults may otherwise lead to distinct + * anon_vmas being allocated, preventing vma merge in subsequent + * mprotect. + */ +struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) +{ + struct anon_vma *anon_vma = NULL; + struct vm_area_struct *prev, *next; + VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_end); + + /* Try next first. */ + next = vma_iter_load(&vmi); + if (next) { + anon_vma = reusable_anon_vma(next, vma, next); + if (anon_vma) + return anon_vma; + } + + prev = vma_prev(&vmi); + VM_BUG_ON_VMA(prev != vma, vma); + prev = vma_prev(&vmi); + /* Try prev next. */ + if (prev) + anon_vma = reusable_anon_vma(prev, prev, vma); + + /* + * We might reach here with anon_vma == NULL if we can't find + * any reusable anon_vma. + * There's no absolute need to look only at touching neighbours: + * we could search further afield for "compatible" anon_vmas. + * But it would probably just be a waste of time searching, + * or lead to too many vmas hanging off the same anon_vma. + * We're trying to allow mprotect remerging later on, + * not trying to minimize memory used for anon_vmas. + */ + return anon_vma; +} + +static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops) +{ + return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite); +} + +static bool vma_is_shared_writable(struct vm_area_struct *vma) +{ + return (vma->vm_flags & (VM_WRITE | VM_SHARED)) == + (VM_WRITE | VM_SHARED); +} + +static bool vma_fs_can_writeback(struct vm_area_struct *vma) +{ + /* No managed pages to writeback. */ + if (vma->vm_flags & VM_PFNMAP) + return false; + + return vma->vm_file && vma->vm_file->f_mapping && + mapping_can_writeback(vma->vm_file->f_mapping); +} + +/* + * Does this VMA require the underlying folios to have their dirty state + * tracked? + */ +bool vma_needs_dirty_tracking(struct vm_area_struct *vma) +{ + /* Only shared, writable VMAs require dirty tracking. */ + if (!vma_is_shared_writable(vma)) + return false; + + /* Does the filesystem need to be notified? */ + if (vm_ops_needs_writenotify(vma->vm_ops)) + return true; + + /* + * Even if the filesystem doesn't indicate a need for writenotify, if it + * can writeback, dirty tracking is still required. + */ + return vma_fs_can_writeback(vma); +} + +/* + * Some shared mappings will want the pages marked read-only + * to track write events. If so, we'll downgrade vm_page_prot + * to the private version (using protection_map[] without the + * VM_SHARED bit). + */ +bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) +{ + /* If it was private or non-writable, the write bit is already clear */ + if (!vma_is_shared_writable(vma)) + return false; + + /* The backer wishes to know when pages are first written to? */ + if (vm_ops_needs_writenotify(vma->vm_ops)) + return true; + + /* The open routine did something to the protections that pgprot_modify + * won't preserve? */ + if (pgprot_val(vm_page_prot) != + pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags))) + return false; + + /* + * Do we need to track softdirty? hugetlb does not support softdirty + * tracking yet. + */ + if (vma_soft_dirty_enabled(vma) && !is_vm_hugetlb_page(vma)) + return true; + + /* Do we need write faults for uffd-wp tracking? */ + if (userfaultfd_wp(vma)) + return true; + + /* Can the mapping track the dirty pages? */ + return vma_fs_can_writeback(vma); +} + +unsigned long count_vma_pages_range(struct mm_struct *mm, + unsigned long addr, unsigned long end) +{ + VMA_ITERATOR(vmi, mm, addr); + struct vm_area_struct *vma; + unsigned long nr_pages = 0; + + for_each_vma_range(vmi, vma, end) { + unsigned long vm_start = max(addr, vma->vm_start); + unsigned long vm_end = min(end, vma->vm_end); + + nr_pages += PHYS_PFN(vm_end - vm_start); + } + + return nr_pages; +} + +static DEFINE_MUTEX(mm_all_locks_mutex); + +static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) +{ + if (!test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_root.rb_node)) { + /* + * The LSB of head.next can't change from under us + * because we hold the mm_all_locks_mutex. + */ + down_write_nest_lock(&anon_vma->root->rwsem, &mm->mmap_lock); + /* + * We can safely modify head.next after taking the + * anon_vma->root->rwsem. If some other vma in this mm shares + * the same anon_vma we won't take it again. + * + * No need of atomic instructions here, head.next + * can't change from under us thanks to the + * anon_vma->root->rwsem. + */ + if (__test_and_set_bit(0, (unsigned long *) + &anon_vma->root->rb_root.rb_root.rb_node)) + BUG(); + } +} + +static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) +{ + if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) { + /* + * AS_MM_ALL_LOCKS can't change from under us because + * we hold the mm_all_locks_mutex. + * + * Operations on ->flags have to be atomic because + * even if AS_MM_ALL_LOCKS is stable thanks to the + * mm_all_locks_mutex, there may be other cpus + * changing other bitflags in parallel to us. + */ + if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags)) + BUG(); + down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_lock); + } +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. + * + * The caller must take the mmap_lock in write mode before calling + * mm_take_all_locks(). The caller isn't allowed to release the + * mmap_lock until mm_drop_all_locks() returns. + * + * mmap_lock in write mode is required in order to block all operations + * that could modify pagetables and free pages without need of + * altering the vma layout. It's also needed in write mode to avoid new + * anon_vmas to be associated with existing vmas. + * + * A single task can't take more than one mm_take_all_locks() in a row + * or it would deadlock. + * + * The LSB in anon_vma->rb_root.rb_node and the AS_MM_ALL_LOCKS bitflag in + * mapping->flags avoid to take the same lock twice, if more than one + * vma in this mm is backed by the same anon_vma or address_space. + * + * We take locks in following order, accordingly to comment at beginning + * of mm/rmap.c: + * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for + * hugetlb mapping); + * - all vmas marked locked + * - all i_mmap_rwsem locks; + * - all anon_vma->rwseml + * + * We can take all locks within these types randomly because the VM code + * doesn't nest them and we protected from parallel mm_take_all_locks() by + * mm_all_locks_mutex. + * + * mm_take_all_locks() and mm_drop_all_locks are expensive operations + * that may have to take thousand of locks. + * + * mm_take_all_locks() can fail if it's interrupted by signals. + */ +int mm_take_all_locks(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + struct anon_vma_chain *avc; + VMA_ITERATOR(vmi, mm, 0); + + mmap_assert_write_locked(mm); + + mutex_lock(&mm_all_locks_mutex); + + /* + * vma_start_write() does not have a complement in mm_drop_all_locks() + * because vma_start_write() is always asymmetrical; it marks a VMA as + * being written to until mmap_write_unlock() or mmap_write_downgrade() + * is reached. + */ + for_each_vma(vmi, vma) { + if (signal_pending(current)) + goto out_unlock; + vma_start_write(vma); + } + + vma_iter_init(&vmi, mm, 0); + for_each_vma(vmi, vma) { + if (signal_pending(current)) + goto out_unlock; + if (vma->vm_file && vma->vm_file->f_mapping && + is_vm_hugetlb_page(vma)) + vm_lock_mapping(mm, vma->vm_file->f_mapping); + } + + vma_iter_init(&vmi, mm, 0); + for_each_vma(vmi, vma) { + if (signal_pending(current)) + goto out_unlock; + if (vma->vm_file && vma->vm_file->f_mapping && + !is_vm_hugetlb_page(vma)) + vm_lock_mapping(mm, vma->vm_file->f_mapping); + } + + vma_iter_init(&vmi, mm, 0); + for_each_vma(vmi, vma) { + if (signal_pending(current)) + goto out_unlock; + if (vma->anon_vma) + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) + vm_lock_anon_vma(mm, avc->anon_vma); + } + + return 0; + +out_unlock: + mm_drop_all_locks(mm); + return -EINTR; +} + +static void vm_unlock_anon_vma(struct anon_vma *anon_vma) +{ + if (test_bit(0, (unsigned long *) &anon_vma->root->rb_root.rb_root.rb_node)) { + /* + * The LSB of head.next can't change to 0 from under + * us because we hold the mm_all_locks_mutex. + * + * We must however clear the bitflag before unlocking + * the vma so the users using the anon_vma->rb_root will + * never see our bitflag. + * + * No need of atomic instructions here, head.next + * can't change from under us until we release the + * anon_vma->root->rwsem. + */ + if (!__test_and_clear_bit(0, (unsigned long *) + &anon_vma->root->rb_root.rb_root.rb_node)) + BUG(); + anon_vma_unlock_write(anon_vma); + } +} + +static void vm_unlock_mapping(struct address_space *mapping) +{ + if (test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) { + /* + * AS_MM_ALL_LOCKS can't change to 0 from under us + * because we hold the mm_all_locks_mutex. + */ + i_mmap_unlock_write(mapping); + if (!test_and_clear_bit(AS_MM_ALL_LOCKS, + &mapping->flags)) + BUG(); + } +} + +/* + * The mmap_lock cannot be released by the caller until + * mm_drop_all_locks() returns. + */ +void mm_drop_all_locks(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + struct anon_vma_chain *avc; + VMA_ITERATOR(vmi, mm, 0); + + mmap_assert_write_locked(mm); + BUG_ON(!mutex_is_locked(&mm_all_locks_mutex)); + + for_each_vma(vmi, vma) { + if (vma->anon_vma) + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) + vm_unlock_anon_vma(avc->anon_vma); + if (vma->vm_file && vma->vm_file->f_mapping) + vm_unlock_mapping(vma->vm_file->f_mapping); + } + + mutex_unlock(&mm_all_locks_mutex); +} diff --git a/mm/vma.h b/mm/vma.h new file mode 100644 index 000000000000..6efdf1768a0a --- /dev/null +++ b/mm/vma.h @@ -0,0 +1,364 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * vma.h + * + * Core VMA manipulation API implemented in vma.c. + */ +#ifndef __MM_VMA_H +#define __MM_VMA_H + +/* + * VMA lock generalization + */ +struct vma_prepare { + struct vm_area_struct *vma; + struct vm_area_struct *adj_next; + struct file *file; + struct address_space *mapping; + struct anon_vma *anon_vma; + struct vm_area_struct *insert; + struct vm_area_struct *remove; + struct vm_area_struct *remove2; +}; + +struct unlink_vma_file_batch { + int count; + struct vm_area_struct *vmas[8]; +}; + +#ifdef CONFIG_DEBUG_VM_MAPLE_TREE +void validate_mm(struct mm_struct *mm); +#else +#define validate_mm(mm) do { } while (0) +#endif + +/* Required for expand_downwards(). */ +void anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma); + +/* Required for expand_downwards(). */ +void anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma); + +/* Required for do_brk_flags(). */ +void vma_prepare(struct vma_prepare *vp); + +/* Required for do_brk_flags(). */ +void init_vma_prep(struct vma_prepare *vp, + struct vm_area_struct *vma); + +/* Required for do_brk_flags(). */ +void vma_complete(struct vma_prepare *vp, + struct vma_iterator *vmi, struct mm_struct *mm); + +int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff, + struct vm_area_struct *next); + +int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, pgoff_t pgoff); + +int +do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct mm_struct *mm, unsigned long start, + unsigned long end, struct list_head *uf, bool unlock); + +int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, + unsigned long start, size_t len, struct list_head *uf, + bool unlock); + +void remove_vma(struct vm_area_struct *vma, bool unreachable); + +void unmap_region(struct mm_struct *mm, struct ma_state *mas, + struct vm_area_struct *vma, struct vm_area_struct *prev, + struct vm_area_struct *next, unsigned long start, + unsigned long end, unsigned long tree_end, bool mm_wr_locked); + +/* Required by mmap_region(). */ +bool +can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, + struct anon_vma *anon_vma, struct file *file, + pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name); + +/* Required by mmap_region() and do_brk_flags(). */ +bool +can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, + struct anon_vma *anon_vma, struct file *file, + pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + struct anon_vma_name *anon_name); + +struct vm_area_struct *vma_modify(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long vm_flags, + struct mempolicy *policy, + struct vm_userfaultfd_ctx uffd_ctx, + struct anon_vma_name *anon_name); + +/* We are about to modify the VMA's flags. */ +static inline struct vm_area_struct +*vma_modify_flags(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), vma->vm_userfaultfd_ctx, + anon_vma_name(vma)); +} + +/* We are about to modify the VMA's flags and/or anon_name. */ +static inline struct vm_area_struct +*vma_modify_flags_name(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + unsigned long new_flags, + struct anon_vma_name *new_name) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), vma->vm_userfaultfd_ctx, new_name); +} + +/* We are about to modify the VMA's memory policy. */ +static inline struct vm_area_struct +*vma_modify_policy(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + struct mempolicy *new_pol) +{ + return vma_modify(vmi, prev, vma, start, end, vma->vm_flags, + new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma)); +} + +/* We are about to modify the VMA's flags and/or uffd context. */ +static inline struct vm_area_struct +*vma_modify_flags_uffd(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags, + struct vm_userfaultfd_ctx new_ctx) +{ + return vma_modify(vmi, prev, vma, start, end, new_flags, + vma_policy(vma), new_ctx, anon_vma_name(vma)); +} + +struct vm_area_struct +*vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, + struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgoff_t pgoff); + +struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, + struct vm_area_struct *vma, + unsigned long delta); + +void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb); + +void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb); + +void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb, + struct vm_area_struct *vma); + +void unlink_file_vma(struct vm_area_struct *vma); + +void vma_link_file(struct vm_area_struct *vma); + +int vma_link(struct mm_struct *mm, struct vm_area_struct *vma); + +struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, + unsigned long addr, unsigned long len, pgoff_t pgoff, + bool *need_rmap_locks); + +struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma); + +bool vma_needs_dirty_tracking(struct vm_area_struct *vma); +bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); + +int mm_take_all_locks(struct mm_struct *mm); +void mm_drop_all_locks(struct mm_struct *mm); +unsigned long count_vma_pages_range(struct mm_struct *mm, + unsigned long addr, unsigned long end); + +static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) +{ + /* + * We want to check manually if we can change individual PTEs writable + * if we can't do that automatically for all PTEs in a mapping. For + * private mappings, that's always the case when we have write + * permissions as we properly have to handle COW. + */ + if (vma->vm_flags & VM_SHARED) + return vma_wants_writenotify(vma, vma->vm_page_prot); + return !!(vma->vm_flags & VM_WRITE); +} + +#ifdef CONFIG_MMU +static inline pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags) +{ + return pgprot_modify(oldprot, vm_get_page_prot(vm_flags)); +} +#endif + +static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, + unsigned long min) +{ + return mas_prev(&vmi->mas, min); +} + +static inline int vma_iter_store_gfp(struct vma_iterator *vmi, + struct vm_area_struct *vma, gfp_t gfp) +{ + if (vmi->mas.status != ma_start && + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) + vma_iter_invalidate(vmi); + + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); + mas_store_gfp(&vmi->mas, vma, gfp); + if (unlikely(mas_is_err(&vmi->mas))) + return -ENOMEM; + + return 0; +} + + +/* + * These three helpers classifies VMAs for virtual memory accounting. + */ + +/* + * Executable code area - executable, not writable, not stack + */ +static inline bool is_exec_mapping(vm_flags_t flags) +{ + return (flags & (VM_EXEC | VM_WRITE | VM_STACK)) == VM_EXEC; +} + +/* + * Stack area (including shadow stacks) + * + * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous: + * do_mmap() forbids all other combinations. + */ +static inline bool is_stack_mapping(vm_flags_t flags) +{ + return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK); +} + +/* + * Data area - private, writable, not stack + */ +static inline bool is_data_mapping(vm_flags_t flags) +{ + return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE; +} + + +static inline void vma_iter_config(struct vma_iterator *vmi, + unsigned long index, unsigned long last) +{ + __mas_set_range(&vmi->mas, index, last - 1); +} + +static inline void vma_iter_reset(struct vma_iterator *vmi) +{ + mas_reset(&vmi->mas); +} + +static inline +struct vm_area_struct *vma_iter_prev_range_limit(struct vma_iterator *vmi, unsigned long min) +{ + return mas_prev_range(&vmi->mas, min); +} + +static inline +struct vm_area_struct *vma_iter_next_range_limit(struct vma_iterator *vmi, unsigned long max) +{ + return mas_next_range(&vmi->mas, max); +} + +static inline int vma_iter_area_lowest(struct vma_iterator *vmi, unsigned long min, + unsigned long max, unsigned long size) +{ + return mas_empty_area(&vmi->mas, min, max - 1, size); +} + +static inline int vma_iter_area_highest(struct vma_iterator *vmi, unsigned long min, + unsigned long max, unsigned long size) +{ + return mas_empty_area_rev(&vmi->mas, min, max - 1, size); +} + +/* + * VMA Iterator functions shared between nommu and mmap + */ +static inline int vma_iter_prealloc(struct vma_iterator *vmi, + struct vm_area_struct *vma) +{ + return mas_preallocate(&vmi->mas, vma, GFP_KERNEL); +} + +static inline void vma_iter_clear(struct vma_iterator *vmi) +{ + mas_store_prealloc(&vmi->mas, NULL); +} + +static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi) +{ + return mas_walk(&vmi->mas); +} + +/* Store a VMA with preallocated memory */ +static inline void vma_iter_store(struct vma_iterator *vmi, + struct vm_area_struct *vma) +{ + +#if defined(CONFIG_DEBUG_VM_MAPLE_TREE) + if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start && + vmi->mas.index > vma->vm_start)) { + pr_warn("%lx > %lx\n store vma %lx-%lx\n into slot %lx-%lx\n", + vmi->mas.index, vma->vm_start, vma->vm_start, + vma->vm_end, vmi->mas.index, vmi->mas.last); + } + if (MAS_WARN_ON(&vmi->mas, vmi->mas.status != ma_start && + vmi->mas.last < vma->vm_start)) { + pr_warn("%lx < %lx\nstore vma %lx-%lx\ninto slot %lx-%lx\n", + vmi->mas.last, vma->vm_start, vma->vm_start, vma->vm_end, + vmi->mas.index, vmi->mas.last); + } +#endif + + if (vmi->mas.status != ma_start && + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) + vma_iter_invalidate(vmi); + + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); + mas_store_prealloc(&vmi->mas, vma); +} + +static inline unsigned long vma_iter_addr(struct vma_iterator *vmi) +{ + return vmi->mas.index; +} + +static inline unsigned long vma_iter_end(struct vma_iterator *vmi) +{ + return vmi->mas.last + 1; +} + +static inline int vma_iter_bulk_alloc(struct vma_iterator *vmi, + unsigned long count) +{ + return mas_expected_entries(&vmi->mas, count); +} + +static inline +struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) +{ + return mas_prev_range(&vmi->mas, 0); +} + +#endif /* __MM_VMA_H */ diff --git a/mm/vma_internal.h b/mm/vma_internal.h new file mode 100644 index 000000000000..14c24d5cb582 --- /dev/null +++ b/mm/vma_internal.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * vma_internal.h + * + * Headers required by vma.c, which can be substituted accordingly when testing + * VMA functionality. + */ + +#ifndef __MM_VMA_INTERNAL_H +#define __MM_VMA_INTERNAL_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "internal.h" + +#endif /* __MM_VMA_INTERNAL_H */ From 802443a44dfff8536f05da2ddd44a293367d2d99 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:39 +0100 Subject: [PATCH 046/416] MAINTAINERS: add entry for new VMA files The vma files contain logic split from mmap.c for the most part and are all relevant to VMA logic, so maintain the same reviewers for both. Link: https://lkml.kernel.org/r/bf2581cce2b4d210deabb5376c6aa0ad6facf1ff.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Acked-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index fe83ba7194ea..4a7b6e25bc77 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -24411,6 +24411,19 @@ F: include/uapi/linux/vsockmon.h F: net/vmw_vsock/ F: tools/testing/vsock/ +VMA +M: Andrew Morton +R: Liam R. Howlett +R: Vlastimil Babka +R: Lorenzo Stoakes +L: linux-mm@kvack.org +S: Maintained +W: https://www.linux-mm.org +T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm +F: mm/vma.c +F: mm/vma.h +F: mm/vma_internal.h + VMALLOC M: Andrew Morton R: Uladzislau Rezki From 74579d8dab476b66cd28715e73832ab777f20984 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:40 +0100 Subject: [PATCH 047/416] tools: separate out shared radix-tree components The core components contained within the radix-tree tests which provide shims for kernel headers and access to the maple tree are useful for testing other things, so separate them out and make the radix tree tests dependent on the shared components. This lays the groundwork for us to add VMA tests of the newly introduced vma.c file. Link: https://lkml.kernel.org/r/1ee720c265808168e0d75608e687607d77c36719.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: SeongJae Park Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- tools/testing/radix-tree/.gitignore | 1 + tools/testing/radix-tree/Makefile | 72 ++++--------------- tools/testing/radix-tree/xarray.c | 10 +-- .../generated => shared}/autoconf.h | 0 tools/testing/{radix-tree => shared}/linux.c | 0 .../{radix-tree => shared}/linux/bug.h | 0 .../{radix-tree => shared}/linux/cpu.h | 0 .../{radix-tree => shared}/linux/idr.h | 0 .../{radix-tree => shared}/linux/init.h | 0 .../{radix-tree => shared}/linux/kconfig.h | 0 .../{radix-tree => shared}/linux/kernel.h | 0 .../{radix-tree => shared}/linux/kmemleak.h | 0 .../{radix-tree => shared}/linux/local_lock.h | 0 .../{radix-tree => shared}/linux/lockdep.h | 0 .../{radix-tree => shared}/linux/maple_tree.h | 0 .../{radix-tree => shared}/linux/percpu.h | 0 .../{radix-tree => shared}/linux/preempt.h | 0 .../{radix-tree => shared}/linux/radix-tree.h | 0 .../{radix-tree => shared}/linux/rcupdate.h | 0 .../{radix-tree => shared}/linux/xarray.h | 0 tools/testing/shared/maple-shared.h | 9 +++ tools/testing/shared/maple-shim.c | 7 ++ tools/testing/shared/shared.h | 33 +++++++++ tools/testing/shared/shared.mk | 72 +++++++++++++++++++ .../trace/events/maple_tree.h | 0 tools/testing/shared/xarray-shared.c | 5 ++ tools/testing/shared/xarray-shared.h | 4 ++ 27 files changed, 144 insertions(+), 69 deletions(-) rename tools/testing/{radix-tree/generated => shared}/autoconf.h (100%) rename tools/testing/{radix-tree => shared}/linux.c (100%) rename tools/testing/{radix-tree => shared}/linux/bug.h (100%) rename tools/testing/{radix-tree => shared}/linux/cpu.h (100%) rename tools/testing/{radix-tree => shared}/linux/idr.h (100%) rename tools/testing/{radix-tree => shared}/linux/init.h (100%) rename tools/testing/{radix-tree => shared}/linux/kconfig.h (100%) rename tools/testing/{radix-tree => shared}/linux/kernel.h (100%) rename tools/testing/{radix-tree => shared}/linux/kmemleak.h (100%) rename tools/testing/{radix-tree => shared}/linux/local_lock.h (100%) rename tools/testing/{radix-tree => shared}/linux/lockdep.h (100%) rename tools/testing/{radix-tree => shared}/linux/maple_tree.h (100%) rename tools/testing/{radix-tree => shared}/linux/percpu.h (100%) rename tools/testing/{radix-tree => shared}/linux/preempt.h (100%) rename tools/testing/{radix-tree => shared}/linux/radix-tree.h (100%) rename tools/testing/{radix-tree => shared}/linux/rcupdate.h (100%) rename tools/testing/{radix-tree => shared}/linux/xarray.h (100%) create mode 100644 tools/testing/shared/maple-shared.h create mode 100644 tools/testing/shared/maple-shim.c create mode 100644 tools/testing/shared/shared.h create mode 100644 tools/testing/shared/shared.mk rename tools/testing/{radix-tree => shared}/trace/events/maple_tree.h (100%) create mode 100644 tools/testing/shared/xarray-shared.c create mode 100644 tools/testing/shared/xarray-shared.h diff --git a/tools/testing/radix-tree/.gitignore b/tools/testing/radix-tree/.gitignore index 49bccb90c35b..ce167a761981 100644 --- a/tools/testing/radix-tree/.gitignore +++ b/tools/testing/radix-tree/.gitignore @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only +generated/autoconf.h generated/bit-length.h generated/map-shift.h idr.c diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile index d1acd7d58850..8b3591a51e1f 100644 --- a/tools/testing/radix-tree/Makefile +++ b/tools/testing/radix-tree/Makefile @@ -1,77 +1,29 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS += -I. -I../../include -I../../../lib -g -Og -Wall \ - -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -LDFLAGS += -fsanitize=address -fsanitize=undefined -LDLIBS+= -lpthread -lurcu +.PHONY: clean + TARGETS = main idr-test multiorder xarray maple -LIBS := slab.o find_bit.o bitmap.o hweight.o vsprintf.o -CORE_OFILES := xarray.o radix-tree.o idr.o linux.o test.o maple.o $(LIBS) -OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \ - regression4.o tag_check.o multiorder.o idr-test.o iteration_check.o \ - iteration_check_2.o benchmark.o - -ifndef SHIFT - SHIFT=3 -endif - -ifeq ($(BUILD), 32) - CFLAGS += -m32 - LDFLAGS += -m32 -LONG_BIT := 32 -endif - -ifndef LONG_BIT -LONG_BIT := $(shell getconf LONG_BIT) -endif +CORE_OFILES = $(SHARED_OFILES) xarray.o maple.o test.o +OFILES = main.o $(CORE_OFILES) regression1.o regression2.o \ + regression3.o regression4.o tag_check.o multiorder.o idr-test.o \ + iteration_check.o iteration_check_2.o benchmark.o targets: generated/map-shift.h generated/bit-length.h $(TARGETS) +include ../shared/shared.mk + main: $(OFILES) idr-test.o: ../../../lib/test_ida.c idr-test: idr-test.o $(CORE_OFILES) -xarray: $(CORE_OFILES) +xarray: $(CORE_OFILES) xarray.o -maple: $(CORE_OFILES) +maple: $(CORE_OFILES) maple.o multiorder: multiorder.o $(CORE_OFILES) clean: - $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h generated/bit-length.h + $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/* -vpath %.c ../../lib - -$(OFILES): Makefile *.h */*.h generated/map-shift.h generated/bit-length.h \ - ../../include/linux/*.h \ - ../../include/asm/*.h \ - ../../../include/linux/xarray.h \ - ../../../include/linux/maple_tree.h \ - ../../../include/linux/radix-tree.h \ - ../../../lib/radix-tree.h \ - ../../../include/linux/idr.h - -radix-tree.c: ../../../lib/radix-tree.c - sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ - -idr.c: ../../../lib/idr.c - sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ - -xarray.o: ../../../lib/xarray.c ../../../lib/test_xarray.c - -maple.o: ../../../lib/maple_tree.c ../../../lib/test_maple_tree.c - -generated/map-shift.h: - @if ! grep -qws $(SHIFT) generated/map-shift.h; then \ - echo "#define XA_CHUNK_SHIFT $(SHIFT)" > \ - generated/map-shift.h; \ - fi - -generated/bit-length.h: FORCE - @if ! grep -qws CONFIG_$(LONG_BIT)BIT generated/bit-length.h; then \ - echo "Generating $@"; \ - echo "#define CONFIG_$(LONG_BIT)BIT 1" > $@; \ - fi - -FORCE: ; +$(OFILES): $(SHARED_DEPS) *.h diff --git a/tools/testing/radix-tree/xarray.c b/tools/testing/radix-tree/xarray.c index d0e53bff1eb6..253208a8541b 100644 --- a/tools/testing/radix-tree/xarray.c +++ b/tools/testing/radix-tree/xarray.c @@ -4,17 +4,9 @@ * Copyright (c) 2018 Matthew Wilcox */ -#define XA_DEBUG +#include "xarray-shared.h" #include "test.h" -#define module_init(x) -#define module_exit(x) -#define MODULE_AUTHOR(x) -#define MODULE_DESCRIPTION(X) -#define MODULE_LICENSE(x) -#define dump_stack() assert(0) - -#include "../../../lib/xarray.c" #undef XA_DEBUG #include "../../../lib/test_xarray.c" diff --git a/tools/testing/radix-tree/generated/autoconf.h b/tools/testing/shared/autoconf.h similarity index 100% rename from tools/testing/radix-tree/generated/autoconf.h rename to tools/testing/shared/autoconf.h diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/shared/linux.c similarity index 100% rename from tools/testing/radix-tree/linux.c rename to tools/testing/shared/linux.c diff --git a/tools/testing/radix-tree/linux/bug.h b/tools/testing/shared/linux/bug.h similarity index 100% rename from tools/testing/radix-tree/linux/bug.h rename to tools/testing/shared/linux/bug.h diff --git a/tools/testing/radix-tree/linux/cpu.h b/tools/testing/shared/linux/cpu.h similarity index 100% rename from tools/testing/radix-tree/linux/cpu.h rename to tools/testing/shared/linux/cpu.h diff --git a/tools/testing/radix-tree/linux/idr.h b/tools/testing/shared/linux/idr.h similarity index 100% rename from tools/testing/radix-tree/linux/idr.h rename to tools/testing/shared/linux/idr.h diff --git a/tools/testing/radix-tree/linux/init.h b/tools/testing/shared/linux/init.h similarity index 100% rename from tools/testing/radix-tree/linux/init.h rename to tools/testing/shared/linux/init.h diff --git a/tools/testing/radix-tree/linux/kconfig.h b/tools/testing/shared/linux/kconfig.h similarity index 100% rename from tools/testing/radix-tree/linux/kconfig.h rename to tools/testing/shared/linux/kconfig.h diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/shared/linux/kernel.h similarity index 100% rename from tools/testing/radix-tree/linux/kernel.h rename to tools/testing/shared/linux/kernel.h diff --git a/tools/testing/radix-tree/linux/kmemleak.h b/tools/testing/shared/linux/kmemleak.h similarity index 100% rename from tools/testing/radix-tree/linux/kmemleak.h rename to tools/testing/shared/linux/kmemleak.h diff --git a/tools/testing/radix-tree/linux/local_lock.h b/tools/testing/shared/linux/local_lock.h similarity index 100% rename from tools/testing/radix-tree/linux/local_lock.h rename to tools/testing/shared/linux/local_lock.h diff --git a/tools/testing/radix-tree/linux/lockdep.h b/tools/testing/shared/linux/lockdep.h similarity index 100% rename from tools/testing/radix-tree/linux/lockdep.h rename to tools/testing/shared/linux/lockdep.h diff --git a/tools/testing/radix-tree/linux/maple_tree.h b/tools/testing/shared/linux/maple_tree.h similarity index 100% rename from tools/testing/radix-tree/linux/maple_tree.h rename to tools/testing/shared/linux/maple_tree.h diff --git a/tools/testing/radix-tree/linux/percpu.h b/tools/testing/shared/linux/percpu.h similarity index 100% rename from tools/testing/radix-tree/linux/percpu.h rename to tools/testing/shared/linux/percpu.h diff --git a/tools/testing/radix-tree/linux/preempt.h b/tools/testing/shared/linux/preempt.h similarity index 100% rename from tools/testing/radix-tree/linux/preempt.h rename to tools/testing/shared/linux/preempt.h diff --git a/tools/testing/radix-tree/linux/radix-tree.h b/tools/testing/shared/linux/radix-tree.h similarity index 100% rename from tools/testing/radix-tree/linux/radix-tree.h rename to tools/testing/shared/linux/radix-tree.h diff --git a/tools/testing/radix-tree/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h similarity index 100% rename from tools/testing/radix-tree/linux/rcupdate.h rename to tools/testing/shared/linux/rcupdate.h diff --git a/tools/testing/radix-tree/linux/xarray.h b/tools/testing/shared/linux/xarray.h similarity index 100% rename from tools/testing/radix-tree/linux/xarray.h rename to tools/testing/shared/linux/xarray.h diff --git a/tools/testing/shared/maple-shared.h b/tools/testing/shared/maple-shared.h new file mode 100644 index 000000000000..3d847edd149d --- /dev/null +++ b/tools/testing/shared/maple-shared.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#define CONFIG_DEBUG_MAPLE_TREE +#define CONFIG_MAPLE_SEARCH +#define MAPLE_32BIT (MAPLE_NODE_SLOTS > 31) +#include "shared.h" +#include +#include +#include "linux/init.h" diff --git a/tools/testing/shared/maple-shim.c b/tools/testing/shared/maple-shim.c new file mode 100644 index 000000000000..640df76f483e --- /dev/null +++ b/tools/testing/shared/maple-shim.c @@ -0,0 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* Very simple shim around the maple tree. */ + +#include "maple-shared.h" + +#include "../../../lib/maple_tree.c" diff --git a/tools/testing/shared/shared.h b/tools/testing/shared/shared.h new file mode 100644 index 000000000000..f08f683812ad --- /dev/null +++ b/tools/testing/shared/shared.h @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include +#include +#include +#include + +#include +#include + +#ifndef module_init +#define module_init(x) +#endif + +#ifndef module_exit +#define module_exit(x) +#endif + +#ifndef MODULE_AUTHOR +#define MODULE_AUTHOR(x) +#endif + +#ifndef MODULE_LICENSE +#define MODULE_LICENSE(x) +#endif + +#ifndef MODULE_DESCRIPTION +#define MODULE_DESCRIPTION(x) +#endif + +#ifndef dump_stack +#define dump_stack() assert(0) +#endif diff --git a/tools/testing/shared/shared.mk b/tools/testing/shared/shared.mk new file mode 100644 index 000000000000..a05f0588513a --- /dev/null +++ b/tools/testing/shared/shared.mk @@ -0,0 +1,72 @@ +# SPDX-License-Identifier: GPL-2.0 + +CFLAGS += -I../shared -I. -I../../include -I../../../lib -g -Og -Wall \ + -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined +LDFLAGS += -fsanitize=address -fsanitize=undefined +LDLIBS += -lpthread -lurcu +LIBS := slab.o find_bit.o bitmap.o hweight.o vsprintf.o +SHARED_OFILES = xarray-shared.o radix-tree.o idr.o linux.o $(LIBS) + +SHARED_DEPS = Makefile ../shared/shared.mk ../shared/*.h generated/map-shift.h \ + generated/bit-length.h generated/autoconf.h \ + ../../include/linux/*.h \ + ../../include/asm/*.h \ + ../../../include/linux/xarray.h \ + ../../../include/linux/maple_tree.h \ + ../../../include/linux/radix-tree.h \ + ../../../lib/radix-tree.h \ + ../../../include/linux/idr.h + +ifndef SHIFT + SHIFT=3 +endif + +ifeq ($(BUILD), 32) + CFLAGS += -m32 + LDFLAGS += -m32 +LONG_BIT := 32 +endif + +ifndef LONG_BIT +LONG_BIT := $(shell getconf LONG_BIT) +endif + +%.o: ../shared/%.c + $(CC) -c $(CFLAGS) $< -o $@ + +vpath %.c ../../lib + +$(SHARED_OFILES): $(SHARED_DEPS) + +radix-tree.c: ../../../lib/radix-tree.c + sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ + +idr.c: ../../../lib/idr.c + sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@ + +xarray-shared.o: ../shared/xarray-shared.c ../../../lib/xarray.c \ + ../../../lib/test_xarray.c + +maple-shared.o: ../shared/maple-shared.c ../../../lib/maple_tree.c \ + ../../../lib/test_maple_tree.c + +generated/autoconf.h: + @mkdir -p generated + cp ../shared/autoconf.h generated/autoconf.h + +generated/map-shift.h: + @mkdir -p generated + @if ! grep -qws $(SHIFT) generated/map-shift.h; then \ + echo "Generating $@"; \ + echo "#define XA_CHUNK_SHIFT $(SHIFT)" > \ + generated/map-shift.h; \ + fi + +generated/bit-length.h: FORCE + @mkdir -p generated + @if ! grep -qws CONFIG_$(LONG_BIT)BIT generated/bit-length.h; then \ + echo "Generating $@"; \ + echo "#define CONFIG_$(LONG_BIT)BIT 1" > $@; \ + fi + +FORCE: ; diff --git a/tools/testing/radix-tree/trace/events/maple_tree.h b/tools/testing/shared/trace/events/maple_tree.h similarity index 100% rename from tools/testing/radix-tree/trace/events/maple_tree.h rename to tools/testing/shared/trace/events/maple_tree.h diff --git a/tools/testing/shared/xarray-shared.c b/tools/testing/shared/xarray-shared.c new file mode 100644 index 000000000000..e90901958dcd --- /dev/null +++ b/tools/testing/shared/xarray-shared.c @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include "xarray-shared.h" + +#include "../../../lib/xarray.c" diff --git a/tools/testing/shared/xarray-shared.h b/tools/testing/shared/xarray-shared.h new file mode 100644 index 000000000000..ac2d16ff53ae --- /dev/null +++ b/tools/testing/shared/xarray-shared.h @@ -0,0 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#define XA_DEBUG +#include "shared.h" From 9325b8b5a1cba1fbabe6e8217e248c5f90fd0e13 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Mon, 29 Jul 2024 12:50:41 +0100 Subject: [PATCH 048/416] tools: add skeleton code for userland testing of VMA logic Establish a new userland VMA unit testing implementation under tools/testing which utilises existing logic providing maple tree support in userland utilising the now-shared code previously exclusive to radix tree testing. This provides fundamental VMA operations whose API is defined in mm/vma.h, while stubbing out superfluous functionality. This exists as a proof-of-concept, with the test implementation functional and sufficient to allow userland compilation of vma.c, but containing only cursory tests to demonstrate basic functionality. Link: https://lkml.kernel.org/r/533ffa2eec771cbe6b387dd049a7f128a53eb616.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Tested-by: SeongJae Park Acked-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Brendan Higgins Cc: Christian Brauner Cc: David Gow Cc: Eric W. Biederman Cc: Jan Kara Cc: Kees Cook Cc: Matthew Wilcox (Oracle) Cc: Rae Moar Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Pengfei Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + tools/testing/vma/.gitignore | 7 + tools/testing/vma/Makefile | 16 + tools/testing/vma/linux/atomic.h | 12 + tools/testing/vma/linux/mmzone.h | 38 ++ tools/testing/vma/vma.c | 207 ++++++++ tools/testing/vma/vma_internal.h | 882 +++++++++++++++++++++++++++++++ 7 files changed, 1163 insertions(+) create mode 100644 tools/testing/vma/.gitignore create mode 100644 tools/testing/vma/Makefile create mode 100644 tools/testing/vma/linux/atomic.h create mode 100644 tools/testing/vma/linux/mmzone.h create mode 100644 tools/testing/vma/vma.c create mode 100644 tools/testing/vma/vma_internal.h diff --git a/MAINTAINERS b/MAINTAINERS index 4a7b6e25bc77..41b15083e4a1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -24423,6 +24423,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm F: mm/vma.c F: mm/vma.h F: mm/vma_internal.h +F: tools/testing/vma/ VMALLOC M: Andrew Morton diff --git a/tools/testing/vma/.gitignore b/tools/testing/vma/.gitignore new file mode 100644 index 000000000000..b003258eba79 --- /dev/null +++ b/tools/testing/vma/.gitignore @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0-only +generated/bit-length.h +generated/map-shift.h +generated/autoconf.h +idr.c +radix-tree.c +vma diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile new file mode 100644 index 000000000000..bfc905d222cf --- /dev/null +++ b/tools/testing/vma/Makefile @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +.PHONY: default + +default: vma + +include ../shared/shared.mk + +OFILES = $(SHARED_OFILES) vma.o maple-shim.o +TARGETS = vma + +vma: $(OFILES) vma_internal.h ../../../mm/vma.c ../../../mm/vma.h + $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS) + +clean: + $(RM) $(TARGETS) *.o radix-tree.c idr.c generated/map-shift.h generated/bit-length.h generated/autoconf.h diff --git a/tools/testing/vma/linux/atomic.h b/tools/testing/vma/linux/atomic.h new file mode 100644 index 000000000000..e01f66f98982 --- /dev/null +++ b/tools/testing/vma/linux/atomic.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _LINUX_ATOMIC_H +#define _LINUX_ATOMIC_H + +#define atomic_t int32_t +#define atomic_inc(x) uatomic_inc(x) +#define atomic_read(x) uatomic_read(x) +#define atomic_set(x, y) do {} while (0) +#define U8_MAX UCHAR_MAX + +#endif /* _LINUX_ATOMIC_H */ diff --git a/tools/testing/vma/linux/mmzone.h b/tools/testing/vma/linux/mmzone.h new file mode 100644 index 000000000000..33cd1517f7a3 --- /dev/null +++ b/tools/testing/vma/linux/mmzone.h @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _LINUX_MMZONE_H +#define _LINUX_MMZONE_H + +#include + +struct pglist_data *first_online_pgdat(void); +struct pglist_data *next_online_pgdat(struct pglist_data *pgdat); + +#define for_each_online_pgdat(pgdat) \ + for (pgdat = first_online_pgdat(); \ + pgdat; \ + pgdat = next_online_pgdat(pgdat)) + +enum zone_type { + __MAX_NR_ZONES +}; + +#define MAX_NR_ZONES __MAX_NR_ZONES +#define MAX_PAGE_ORDER 10 +#define MAX_ORDER_NR_PAGES (1 << MAX_PAGE_ORDER) + +#define pageblock_order MAX_PAGE_ORDER +#define pageblock_nr_pages BIT(pageblock_order) +#define pageblock_align(pfn) ALIGN((pfn), pageblock_nr_pages) +#define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) + +struct zone { + atomic_long_t managed_pages; +}; + +typedef struct pglist_data { + struct zone node_zones[MAX_NR_ZONES]; + +} pg_data_t; + +#endif /* _LINUX_MMZONE_H */ diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c new file mode 100644 index 000000000000..48e033c60d87 --- /dev/null +++ b/tools/testing/vma/vma.c @@ -0,0 +1,207 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include + +#include "maple-shared.h" +#include "vma_internal.h" + +/* + * Directly import the VMA implementation here. Our vma_internal.h wrapper + * provides userland-equivalent functionality for everything vma.c uses. + */ +#include "../../../mm/vma.c" + +const struct vm_operations_struct vma_dummy_vm_ops; + +#define ASSERT_TRUE(_expr) \ + do { \ + if (!(_expr)) { \ + fprintf(stderr, \ + "Assert FAILED at %s:%d:%s(): %s is FALSE.\n", \ + __FILE__, __LINE__, __FUNCTION__, #_expr); \ + return false; \ + } \ + } while (0) +#define ASSERT_FALSE(_expr) ASSERT_TRUE(!(_expr)) +#define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) +#define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2)) + +static struct vm_area_struct *alloc_vma(struct mm_struct *mm, + unsigned long start, + unsigned long end, + pgoff_t pgoff, + vm_flags_t flags) +{ + struct vm_area_struct *ret = vm_area_alloc(mm); + + if (ret == NULL) + return NULL; + + ret->vm_start = start; + ret->vm_end = end; + ret->vm_pgoff = pgoff; + ret->__vm_flags = flags; + + return ret; +} + +static bool test_simple_merge(void) +{ + struct vm_area_struct *vma; + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + struct vm_area_struct *vma_left = alloc_vma(&mm, 0, 0x1000, 0, flags); + struct vm_area_struct *vma_middle = alloc_vma(&mm, 0x1000, 0x2000, 1, flags); + struct vm_area_struct *vma_right = alloc_vma(&mm, 0x2000, 0x3000, 2, flags); + VMA_ITERATOR(vmi, &mm, 0x1000); + + ASSERT_FALSE(vma_link(&mm, vma_left)); + ASSERT_FALSE(vma_link(&mm, vma_middle)); + ASSERT_FALSE(vma_link(&mm, vma_right)); + + vma = vma_merge_new_vma(&vmi, vma_left, vma_middle, 0x1000, + 0x2000, 1); + ASSERT_NE(vma, NULL); + + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x3000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->vm_flags, flags); + + vm_area_free(vma); + mtree_destroy(&mm.mm_mt); + + return true; +} + +static bool test_simple_modify(void) +{ + struct vm_area_struct *vma; + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + struct vm_area_struct *init_vma = alloc_vma(&mm, 0, 0x3000, 0, flags); + VMA_ITERATOR(vmi, &mm, 0x1000); + + ASSERT_FALSE(vma_link(&mm, init_vma)); + + /* + * The flags will not be changed, the vma_modify_flags() function + * performs the merge/split only. + */ + vma = vma_modify_flags(&vmi, init_vma, init_vma, + 0x1000, 0x2000, VM_READ | VM_MAYREAD); + ASSERT_NE(vma, NULL); + /* We modify the provided VMA, and on split allocate new VMAs. */ + ASSERT_EQ(vma, init_vma); + + ASSERT_EQ(vma->vm_start, 0x1000); + ASSERT_EQ(vma->vm_end, 0x2000); + ASSERT_EQ(vma->vm_pgoff, 1); + + /* + * Now walk through the three split VMAs and make sure they are as + * expected. + */ + + vma_iter_set(&vmi, 0); + vma = vma_iter_load(&vmi); + + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x1000); + ASSERT_EQ(vma->vm_pgoff, 0); + + vm_area_free(vma); + vma_iter_clear(&vmi); + + vma = vma_next(&vmi); + + ASSERT_EQ(vma->vm_start, 0x1000); + ASSERT_EQ(vma->vm_end, 0x2000); + ASSERT_EQ(vma->vm_pgoff, 1); + + vm_area_free(vma); + vma_iter_clear(&vmi); + + vma = vma_next(&vmi); + + ASSERT_EQ(vma->vm_start, 0x2000); + ASSERT_EQ(vma->vm_end, 0x3000); + ASSERT_EQ(vma->vm_pgoff, 2); + + vm_area_free(vma); + mtree_destroy(&mm.mm_mt); + + return true; +} + +static bool test_simple_expand(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x1000, 0, flags); + VMA_ITERATOR(vmi, &mm, 0); + + ASSERT_FALSE(vma_link(&mm, vma)); + + ASSERT_FALSE(vma_expand(&vmi, vma, 0, 0x3000, 0, NULL)); + + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x3000); + ASSERT_EQ(vma->vm_pgoff, 0); + + vm_area_free(vma); + mtree_destroy(&mm.mm_mt); + + return true; +} + +static bool test_simple_shrink(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x3000, 0, flags); + VMA_ITERATOR(vmi, &mm, 0); + + ASSERT_FALSE(vma_link(&mm, vma)); + + ASSERT_FALSE(vma_shrink(&vmi, vma, 0, 0x1000, 0)); + + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x1000); + ASSERT_EQ(vma->vm_pgoff, 0); + + vm_area_free(vma); + mtree_destroy(&mm.mm_mt); + + return true; +} + +int main(void) +{ + int num_tests = 0, num_fail = 0; + + maple_tree_init(); + +#define TEST(name) \ + do { \ + num_tests++; \ + if (!test_##name()) { \ + num_fail++; \ + fprintf(stderr, "Test " #name " FAILED\n"); \ + } \ + } while (0) + + TEST(simple_merge); + TEST(simple_modify); + TEST(simple_expand); + TEST(simple_shrink); + +#undef TEST + + printf("%d tests run, %d passed, %d failed.\n", + num_tests, num_tests - num_fail, num_fail); + + return num_fail == 0 ? EXIT_SUCCESS : EXIT_FAILURE; +} diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h new file mode 100644 index 000000000000..093560e5b2ac --- /dev/null +++ b/tools/testing/vma/vma_internal.h @@ -0,0 +1,882 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * vma_internal.h + * + * Header providing userland wrappers and shims for the functionality provided + * by mm/vma_internal.h. + * + * We make the header guard the same as mm/vma_internal.h, so if this shim + * header is included, it precludes the inclusion of the kernel one. + */ + +#ifndef __MM_VMA_INTERNAL_H +#define __MM_VMA_INTERNAL_H + +#define __private +#define __bitwise +#define __randomize_layout + +#define CONFIG_MMU +#define CONFIG_PER_VMA_LOCK + +#include + +#include +#include +#include +#include +#include + +#define VM_WARN_ON(_expr) (WARN_ON(_expr)) +#define VM_WARN_ON_ONCE(_expr) (WARN_ON_ONCE(_expr)) +#define VM_BUG_ON(_expr) (BUG_ON(_expr)) +#define VM_BUG_ON_VMA(_expr, _vma) (BUG_ON(_expr)) + +#define VM_NONE 0x00000000 +#define VM_READ 0x00000001 +#define VM_WRITE 0x00000002 +#define VM_EXEC 0x00000004 +#define VM_SHARED 0x00000008 +#define VM_MAYREAD 0x00000010 +#define VM_MAYWRITE 0x00000020 +#define VM_GROWSDOWN 0x00000100 +#define VM_PFNMAP 0x00000400 +#define VM_LOCKED 0x00002000 +#define VM_IO 0x00004000 +#define VM_DONTEXPAND 0x00040000 +#define VM_ACCOUNT 0x00100000 +#define VM_MIXEDMAP 0x10000000 +#define VM_STACK VM_GROWSDOWN +#define VM_SHADOW_STACK VM_NONE +#define VM_SOFTDIRTY 0 + +#define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC) +#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP) + +#define FIRST_USER_ADDRESS 0UL +#define USER_PGTABLES_CEILING 0UL + +#define vma_policy(vma) NULL + +#define down_write_nest_lock(sem, nest_lock) + +#define pgprot_val(x) ((x).pgprot) +#define __pgprot(x) ((pgprot_t) { (x) } ) + +#define for_each_vma(__vmi, __vma) \ + while (((__vma) = vma_next(&(__vmi))) != NULL) + +/* The MM code likes to work with exclusive end addresses */ +#define for_each_vma_range(__vmi, __vma, __end) \ + while (((__vma) = vma_find(&(__vmi), (__end))) != NULL) + +#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) + +#define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT)) + +#define test_and_set_bit(nr, addr) __test_and_set_bit(nr, addr) +#define test_and_clear_bit(nr, addr) __test_and_clear_bit(nr, addr) + +#define TASK_SIZE ((1ul << 47)-PAGE_SIZE) + +#define AS_MM_ALL_LOCKS 2 + +#define current NULL + +/* We hardcode this for now. */ +#define sysctl_max_map_count 0x1000000UL + +#define pgoff_t unsigned long +typedef unsigned long pgprotval_t; +typedef struct pgprot { pgprotval_t pgprot; } pgprot_t; +typedef unsigned long vm_flags_t; +typedef __bitwise unsigned int vm_fault_t; + +typedef struct refcount_struct { + atomic_t refs; +} refcount_t; + +struct kref { + refcount_t refcount; +}; + +struct anon_vma { + struct anon_vma *root; + struct rb_root_cached rb_root; +}; + +struct anon_vma_chain { + struct anon_vma *anon_vma; + struct list_head same_vma; +}; + +struct anon_vma_name { + struct kref kref; + /* The name needs to be at the end because it is dynamically sized. */ + char name[]; +}; + +struct vma_iterator { + struct ma_state mas; +}; + +#define VMA_ITERATOR(name, __mm, __addr) \ + struct vma_iterator name = { \ + .mas = { \ + .tree = &(__mm)->mm_mt, \ + .index = __addr, \ + .node = NULL, \ + .status = ma_start, \ + }, \ + } + +struct address_space { + struct rb_root_cached i_mmap; + unsigned long flags; + atomic_t i_mmap_writable; +}; + +struct vm_userfaultfd_ctx {}; +struct mempolicy {}; +struct mmu_gather {}; +struct mutex {}; +#define DEFINE_MUTEX(mutexname) \ + struct mutex mutexname = {} + +struct mm_struct { + struct maple_tree mm_mt; + int map_count; /* number of VMAs */ + unsigned long total_vm; /* Total pages mapped */ + unsigned long locked_vm; /* Pages that have PG_mlocked set */ + unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ + unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ + unsigned long stack_vm; /* VM_STACK */ +}; + +struct vma_lock { + struct rw_semaphore lock; +}; + + +struct file { + struct address_space *f_mapping; +}; + +struct vm_area_struct { + /* The first cache line has the info for VMA tree walking. */ + + union { + struct { + /* VMA covers [vm_start; vm_end) addresses within mm */ + unsigned long vm_start; + unsigned long vm_end; + }; +#ifdef CONFIG_PER_VMA_LOCK + struct rcu_head vm_rcu; /* Used for deferred freeing. */ +#endif + }; + + struct mm_struct *vm_mm; /* The address space we belong to. */ + pgprot_t vm_page_prot; /* Access permissions of this VMA. */ + + /* + * Flags, see mm.h. + * To modify use vm_flags_{init|reset|set|clear|mod} functions. + */ + union { + const vm_flags_t vm_flags; + vm_flags_t __private __vm_flags; + }; + +#ifdef CONFIG_PER_VMA_LOCK + /* Flag to indicate areas detached from the mm->mm_mt tree */ + bool detached; + + /* + * Can only be written (using WRITE_ONCE()) while holding both: + * - mmap_lock (in write mode) + * - vm_lock->lock (in write mode) + * Can be read reliably while holding one of: + * - mmap_lock (in read or write mode) + * - vm_lock->lock (in read or write mode) + * Can be read unreliably (using READ_ONCE()) for pessimistic bailout + * while holding nothing (except RCU to keep the VMA struct allocated). + * + * This sequence counter is explicitly allowed to overflow; sequence + * counter reuse can only lead to occasional unnecessary use of the + * slowpath. + */ + int vm_lock_seq; + struct vma_lock *vm_lock; +#endif + + /* + * For areas with an address space and backing store, + * linkage into the address_space->i_mmap interval tree. + * + */ + struct { + struct rb_node rb; + unsigned long rb_subtree_last; + } shared; + + /* + * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma + * list, after a COW of one of the file pages. A MAP_SHARED vma + * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack + * or brk vma (with NULL file) can only be in an anon_vma list. + */ + struct list_head anon_vma_chain; /* Serialized by mmap_lock & + * page_table_lock */ + struct anon_vma *anon_vma; /* Serialized by page_table_lock */ + + /* Function pointers to deal with this struct. */ + const struct vm_operations_struct *vm_ops; + + /* Information about our backing store: */ + unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE + units */ + struct file * vm_file; /* File we map to (can be NULL). */ + void * vm_private_data; /* was vm_pte (shared mem) */ + +#ifdef CONFIG_ANON_VMA_NAME + /* + * For private and shared anonymous mappings, a pointer to a null + * terminated string containing the name given to the vma, or NULL if + * unnamed. Serialized by mmap_lock. Use anon_vma_name to access. + */ + struct anon_vma_name *anon_name; +#endif +#ifdef CONFIG_SWAP + atomic_long_t swap_readahead_info; +#endif +#ifndef CONFIG_MMU + struct vm_region *vm_region; /* NOMMU mapping region */ +#endif +#ifdef CONFIG_NUMA + struct mempolicy *vm_policy; /* NUMA policy for the VMA */ +#endif +#ifdef CONFIG_NUMA_BALANCING + struct vma_numab_state *numab_state; /* NUMA Balancing state */ +#endif + struct vm_userfaultfd_ctx vm_userfaultfd_ctx; +} __randomize_layout; + +struct vm_fault {}; + +struct vm_operations_struct { + void (*open)(struct vm_area_struct * area); + /** + * @close: Called when the VMA is being removed from the MM. + * Context: User context. May sleep. Caller holds mmap_lock. + */ + void (*close)(struct vm_area_struct * area); + /* Called any time before splitting to check if it's allowed */ + int (*may_split)(struct vm_area_struct *area, unsigned long addr); + int (*mremap)(struct vm_area_struct *area); + /* + * Called by mprotect() to make driver-specific permission + * checks before mprotect() is finalised. The VMA must not + * be modified. Returns 0 if mprotect() can proceed. + */ + int (*mprotect)(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long newflags); + vm_fault_t (*fault)(struct vm_fault *vmf); + vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order); + vm_fault_t (*map_pages)(struct vm_fault *vmf, + pgoff_t start_pgoff, pgoff_t end_pgoff); + unsigned long (*pagesize)(struct vm_area_struct * area); + + /* notification that a previously read-only page is about to become + * writable, if an error is returned it will cause a SIGBUS */ + vm_fault_t (*page_mkwrite)(struct vm_fault *vmf); + + /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ + vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf); + + /* called by access_process_vm when get_user_pages() fails, typically + * for use by special VMAs. See also generic_access_phys() for a generic + * implementation useful for any iomem mapping. + */ + int (*access)(struct vm_area_struct *vma, unsigned long addr, + void *buf, int len, int write); + + /* Called by the /proc/PID/maps code to ask the vma whether it + * has a special name. Returning non-NULL will also cause this + * vma to be dumped unconditionally. */ + const char *(*name)(struct vm_area_struct *vma); + +#ifdef CONFIG_NUMA + /* + * set_policy() op must add a reference to any non-NULL @new mempolicy + * to hold the policy upon return. Caller should pass NULL @new to + * remove a policy and fall back to surrounding context--i.e. do not + * install a MPOL_DEFAULT policy, nor the task or system default + * mempolicy. + */ + int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new); + + /* + * get_policy() op must add reference [mpol_get()] to any policy at + * (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure + * in mm/mempolicy.c will do this automatically. + * get_policy() must NOT add a ref if the policy at (vma,addr) is not + * marked as MPOL_SHARED. vma policies are protected by the mmap_lock. + * If no [shared/vma] mempolicy exists at the addr, get_policy() op + * must return NULL--i.e., do not "fallback" to task or system default + * policy. + */ + struct mempolicy *(*get_policy)(struct vm_area_struct *vma, + unsigned long addr, pgoff_t *ilx); +#endif + /* + * Called by vm_normal_page() for special PTEs to find the + * page for @addr. This is useful if the default behavior + * (using pte_page()) would not find the correct page. + */ + struct page *(*find_special_page)(struct vm_area_struct *vma, + unsigned long addr); +}; + +static inline void vma_iter_invalidate(struct vma_iterator *vmi) +{ + mas_pause(&vmi->mas); +} + +static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) +{ + return __pgprot(pgprot_val(oldprot) | pgprot_val(newprot)); +} + +static inline pgprot_t vm_get_page_prot(unsigned long vm_flags) +{ + return __pgprot(vm_flags); +} + +static inline bool is_shared_maywrite(vm_flags_t vm_flags) +{ + return (vm_flags & (VM_SHARED | VM_MAYWRITE)) == + (VM_SHARED | VM_MAYWRITE); +} + +static inline bool vma_is_shared_maywrite(struct vm_area_struct *vma) +{ + return is_shared_maywrite(vma->vm_flags); +} + +static inline struct vm_area_struct *vma_next(struct vma_iterator *vmi) +{ + /* + * Uses mas_find() to get the first VMA when the iterator starts. + * Calling mas_next() could skip the first entry. + */ + return mas_find(&vmi->mas, ULONG_MAX); +} + +static inline bool vma_lock_alloc(struct vm_area_struct *vma) +{ + vma->vm_lock = calloc(1, sizeof(struct vma_lock)); + + if (!vma->vm_lock) + return false; + + init_rwsem(&vma->vm_lock->lock); + vma->vm_lock_seq = -1; + + return true; +} + +static inline void vma_assert_write_locked(struct vm_area_struct *); +static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) +{ + /* When detaching vma should be write-locked */ + if (detached) + vma_assert_write_locked(vma); + vma->detached = detached; +} + +extern const struct vm_operations_struct vma_dummy_vm_ops; + +static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) +{ + memset(vma, 0, sizeof(*vma)); + vma->vm_mm = mm; + vma->vm_ops = &vma_dummy_vm_ops; + INIT_LIST_HEAD(&vma->anon_vma_chain); + vma_mark_detached(vma, false); +} + +static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) +{ + struct vm_area_struct *vma = calloc(1, sizeof(struct vm_area_struct)); + + if (!vma) + return NULL; + + vma_init(vma, mm); + if (!vma_lock_alloc(vma)) { + free(vma); + return NULL; + } + + return vma; +} + +static inline struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) +{ + struct vm_area_struct *new = calloc(1, sizeof(struct vm_area_struct)); + + if (!new) + return NULL; + + memcpy(new, orig, sizeof(*new)); + if (!vma_lock_alloc(new)) { + free(new); + return NULL; + } + INIT_LIST_HEAD(&new->anon_vma_chain); + + return new; +} + +/* + * These are defined in vma.h, but sadly vm_stat_account() is referenced by + * kernel/fork.c, so we have to these broadly available there, and temporarily + * define them here to resolve the dependency cycle. + */ + +#define is_exec_mapping(flags) \ + ((flags & (VM_EXEC | VM_WRITE | VM_STACK)) == VM_EXEC) + +#define is_stack_mapping(flags) \ + (((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK)) + +#define is_data_mapping(flags) \ + ((flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE) + +static inline void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, + long npages) +{ + WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm)+npages); + + if (is_exec_mapping(flags)) + mm->exec_vm += npages; + else if (is_stack_mapping(flags)) + mm->stack_vm += npages; + else if (is_data_mapping(flags)) + mm->data_vm += npages; +} + +#undef is_exec_mapping +#undef is_stack_mapping +#undef is_data_mapping + +/* Currently stubbed but we may later wish to un-stub. */ +static inline void vm_acct_memory(long pages); +static inline void vm_unacct_memory(long pages) +{ + vm_acct_memory(-pages); +} + +static inline void mapping_allow_writable(struct address_space *mapping) +{ + atomic_inc(&mapping->i_mmap_writable); +} + +static inline void vma_set_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + pgoff_t pgoff) +{ + vma->vm_start = start; + vma->vm_end = end; + vma->vm_pgoff = pgoff; +} + +static inline +struct vm_area_struct *vma_find(struct vma_iterator *vmi, unsigned long max) +{ + return mas_find(&vmi->mas, max - 1); +} + +static inline int vma_iter_clear_gfp(struct vma_iterator *vmi, + unsigned long start, unsigned long end, gfp_t gfp) +{ + __mas_set_range(&vmi->mas, start, end - 1); + mas_store_gfp(&vmi->mas, NULL, gfp); + if (unlikely(mas_is_err(&vmi->mas))) + return -ENOMEM; + + return 0; +} + +static inline void mmap_assert_locked(struct mm_struct *); +static inline struct vm_area_struct *find_vma_intersection(struct mm_struct *mm, + unsigned long start_addr, + unsigned long end_addr) +{ + unsigned long index = start_addr; + + mmap_assert_locked(mm); + return mt_find(&mm->mm_mt, &index, end_addr - 1); +} + +static inline +struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr) +{ + return mtree_load(&mm->mm_mt, addr); +} + +static inline struct vm_area_struct *vma_prev(struct vma_iterator *vmi) +{ + return mas_prev(&vmi->mas, 0); +} + +static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr) +{ + mas_set(&vmi->mas, addr); +} + +static inline bool vma_is_anonymous(struct vm_area_struct *vma) +{ + return !vma->vm_ops; +} + +/* Defined in vma.h, so temporarily define here to avoid circular dependency. */ +#define vma_iter_load(vmi) \ + mas_walk(&(vmi)->mas) + +static inline struct vm_area_struct * +find_vma_prev(struct mm_struct *mm, unsigned long addr, + struct vm_area_struct **pprev) +{ + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, addr); + + vma = vma_iter_load(&vmi); + *pprev = vma_prev(&vmi); + if (!vma) + vma = vma_next(&vmi); + return vma; +} + +#undef vma_iter_load + +static inline void vma_iter_init(struct vma_iterator *vmi, + struct mm_struct *mm, unsigned long addr) +{ + mas_init(&vmi->mas, &mm->mm_mt, addr); +} + +/* Stubbed functions. */ + +static inline struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) +{ + return NULL; +} + +static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, + struct vm_userfaultfd_ctx vm_ctx) +{ + return true; +} + +static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1, + struct anon_vma_name *anon_name2) +{ + return true; +} + +static inline void might_sleep(void) +{ +} + +static inline unsigned long vma_pages(struct vm_area_struct *vma) +{ + return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; +} + +static inline void fput(struct file *) +{ +} + +static inline void mpol_put(struct mempolicy *) +{ +} + +static inline void vma_lock_free(struct vm_area_struct *vma) +{ + free(vma->vm_lock); +} + +static inline void __vm_area_free(struct vm_area_struct *vma) +{ + vma_lock_free(vma); + free(vma); +} + +static inline void vm_area_free(struct vm_area_struct *vma) +{ + __vm_area_free(vma); +} + +static inline void lru_add_drain(void) +{ +} + +static inline void tlb_gather_mmu(struct mmu_gather *, struct mm_struct *) +{ +} + +static inline void update_hiwater_rss(struct mm_struct *) +{ +} + +static inline void update_hiwater_vm(struct mm_struct *) +{ +} + +static inline void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, + struct vm_area_struct *vma, unsigned long start_addr, + unsigned long end_addr, unsigned long tree_end, + bool mm_wr_locked) +{ + (void)tlb; + (void)mas; + (void)vma; + (void)start_addr; + (void)end_addr; + (void)tree_end; + (void)mm_wr_locked; +} + +static inline void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, + struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling, bool mm_wr_locked) +{ + (void)tlb; + (void)mas; + (void)vma; + (void)floor; + (void)ceiling; + (void)mm_wr_locked; +} + +static inline void mapping_unmap_writable(struct address_space *) +{ +} + +static inline void flush_dcache_mmap_lock(struct address_space *) +{ +} + +static inline void tlb_finish_mmu(struct mmu_gather *) +{ +} + +static inline void get_file(struct file *) +{ +} + +static inline int vma_dup_policy(struct vm_area_struct *, struct vm_area_struct *) +{ + return 0; +} + +static inline int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *) +{ + return 0; +} + +static inline void vma_start_write(struct vm_area_struct *) +{ +} + +static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + long adjust_next) +{ + (void)vma; + (void)start; + (void)end; + (void)adjust_next; +} + +static inline void vma_iter_free(struct vma_iterator *vmi) +{ + mas_destroy(&vmi->mas); +} + +static inline void vm_acct_memory(long pages) +{ +} + +static inline void vma_interval_tree_insert(struct vm_area_struct *, + struct rb_root_cached *) +{ +} + +static inline void vma_interval_tree_remove(struct vm_area_struct *, + struct rb_root_cached *) +{ +} + +static inline void flush_dcache_mmap_unlock(struct address_space *) +{ +} + +static inline void anon_vma_interval_tree_insert(struct anon_vma_chain*, + struct rb_root_cached *) +{ +} + +static inline void anon_vma_interval_tree_remove(struct anon_vma_chain*, + struct rb_root_cached *) +{ +} + +static inline void uprobe_mmap(struct vm_area_struct *) +{ +} + +static inline void uprobe_munmap(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + (void)vma; + (void)start; + (void)end; +} + +static inline void i_mmap_lock_write(struct address_space *) +{ +} + +static inline void anon_vma_lock_write(struct anon_vma *) +{ +} + +static inline void vma_assert_write_locked(struct vm_area_struct *) +{ +} + +static inline void unlink_anon_vmas(struct vm_area_struct *) +{ +} + +static inline void anon_vma_unlock_write(struct anon_vma *) +{ +} + +static inline void i_mmap_unlock_write(struct address_space *) +{ +} + +static inline void anon_vma_merge(struct vm_area_struct *, + struct vm_area_struct *) +{ +} + +static inline int userfaultfd_unmap_prep(struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + struct list_head *unmaps) +{ + (void)vma; + (void)start; + (void)end; + (void)unmaps; + + return 0; +} + +static inline void mmap_write_downgrade(struct mm_struct *) +{ +} + +static inline void mmap_read_unlock(struct mm_struct *) +{ +} + +static inline void mmap_write_unlock(struct mm_struct *) +{ +} + +static inline bool can_modify_mm(struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + (void)mm; + (void)start; + (void)end; + + return true; +} + +static inline void arch_unmap(struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + (void)mm; + (void)start; + (void)end; +} + +static inline void mmap_assert_locked(struct mm_struct *) +{ +} + +static inline bool mpol_equal(struct mempolicy *, struct mempolicy *) +{ + return true; +} + +static inline void khugepaged_enter_vma(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + (void)vma; + (void)vm_flags; +} + +static inline bool mapping_can_writeback(struct address_space *) +{ + return true; +} + +static inline bool is_vm_hugetlb_page(struct vm_area_struct *) +{ + return false; +} + +static inline bool vma_soft_dirty_enabled(struct vm_area_struct *) +{ + return false; +} + +static inline bool userfaultfd_wp(struct vm_area_struct *) +{ + return false; +} + +static inline void mmap_assert_write_locked(struct mm_struct *) +{ +} + +static inline void mutex_lock(struct mutex *) +{ +} + +static inline void mutex_unlock(struct mutex *) +{ +} + +static inline bool mutex_is_locked(struct mutex *) +{ + return true; +} + +static inline bool signal_pending(void *) +{ + return false; +} + +#endif /* __MM_VMA_INTERNAL_H */ From 29943248af0a8ac3edb808564baa417b40e35521 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 29 Jul 2024 14:47:17 +0530 Subject: [PATCH 049/416] mm: improve code consistency with zonelist_* helper functions Replace direct access to zoneref->zone, zoneref->zone_idx, or zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper functions for consistency. No functional change. Link: https://lkml.kernel.org/r/20240729091717.464-1-shivankg@amd.com Co-developed-by: Shivank Garg Signed-off-by: Shivank Garg Signed-off-by: Wei Yang Acked-by: David Hildenbrand Cc: Mike Rapoport (IBM) Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 4 ++-- include/trace/events/oom.h | 4 ++-- mm/mempolicy.c | 4 ++-- mm/mmzone.c | 2 +- mm/page_alloc.c | 22 +++++++++++----------- 5 files changed, 18 insertions(+), 18 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1dc6248feb83..dbb69f96a152 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1688,7 +1688,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, zone = zonelist_zone(z)) #define for_next_zone_zonelist_nodemask(zone, z, highidx, nodemask) \ - for (zone = z->zone; \ + for (zone = zonelist_zone(z); \ zone; \ z = next_zones_zonelist(++z, highidx, nodemask), \ zone = zonelist_zone(z)) @@ -1724,7 +1724,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes) nid = first_node(*nodes); zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); - return (!z->zone) ? true : false; + return (!zonelist_zone(z)) ? true : false; } diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h index a42be4c8563b..9f0a5d1482c4 100644 --- a/include/trace/events/oom.h +++ b/include/trace/events/oom.h @@ -55,8 +55,8 @@ TRACE_EVENT(reclaim_retry_zone, ), TP_fast_assign( - __entry->node = zone_to_nid(zoneref->zone); - __entry->zone_idx = zoneref->zone_idx; + __entry->node = zonelist_node_idx(zoneref); + __entry->zone_idx = zonelist_zone_idx(zoneref); __entry->order = order; __entry->reclaimable = reclaimable; __entry->available = available; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b858e22b259d..b3b5f376471f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1951,7 +1951,7 @@ unsigned int mempolicy_slab_node(void) zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, highest_zoneidx, &policy->nodes); - return z->zone ? zone_to_nid(z->zone) : node; + return zonelist_zone(z) ? zonelist_node_idx(z) : node; } case MPOL_LOCAL: return node; @@ -2809,7 +2809,7 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, node_zonelist(thisnid, GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), &pol->nodes); - polnid = zone_to_nid(z->zone); + polnid = zonelist_node_idx(z); break; default: diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..f9baa8882fbf 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -66,7 +66,7 @@ struct zoneref *__next_zones_zonelist(struct zoneref *z, z++; else while (zonelist_zone_idx(z) > highest_zoneidx || - (z->zone && !zref_in_nodemask(z, nodes))) + (zonelist_zone(z) && !zref_in_nodemask(z, nodes))) z++; return z; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c565de8f48e9..60742d057b05 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3350,7 +3350,7 @@ retry: } if (no_fallback && nr_online_nodes > 1 && - zone != ac->preferred_zoneref->zone) { + zone != zonelist_zone(ac->preferred_zoneref)) { int local_nid; /* @@ -3358,7 +3358,7 @@ retry: * fragmenting fallbacks. Locality is more important * than fragmentation avoidance. */ - local_nid = zone_to_nid(ac->preferred_zoneref->zone); + local_nid = zonelist_node_idx(ac->preferred_zoneref); if (zone_to_nid(zone) != local_nid) { alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; @@ -3411,7 +3411,7 @@ check_alloc_wmark: goto try_this_zone; if (!node_reclaim_enabled() || - !zone_allows_reclaim(ac->preferred_zoneref->zone, zone)) + !zone_allows_reclaim(zonelist_zone(ac->preferred_zoneref), zone)) continue; ret = node_reclaim(zone->zone_pgdat, gfp_mask, order); @@ -3433,7 +3433,7 @@ check_alloc_wmark: } try_this_zone: - page = rmqueue(ac->preferred_zoneref->zone, zone, order, + page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order, gfp_mask, alloc_flags, ac->migratetype); if (page) { prep_new_page(page, order, gfp_mask, alloc_flags); @@ -4202,7 +4202,7 @@ restart: */ ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, ac->nodemask); - if (!ac->preferred_zoneref->zone) + if (!zonelist_zone(ac->preferred_zoneref)) goto nopage; /* @@ -4214,7 +4214,7 @@ restart: struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, &cpuset_current_mems_allowed); - if (!z->zone) + if (!zonelist_zone(z)) goto nopage; } @@ -4571,8 +4571,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, continue; } - if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone && - zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) { + if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) && + zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) { goto failed; } @@ -4631,7 +4631,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, pcp_trylock_finish(UP_flags); __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); - zone_statistics(ac.preferred_zoneref->zone, zone, nr_account); + zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account); out: return nr_populated; @@ -4689,7 +4689,7 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. */ - alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); + alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp); /* First allocation attempt */ page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); @@ -5294,7 +5294,7 @@ int local_memory_node(int node) z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL), gfp_zone(GFP_KERNEL), NULL); - return zone_to_nid(z->zone); + return zonelist_node_idx(z); } #endif From 5c0532500f102ae8f80be122223eb692ca1d6721 Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Mon, 29 Jul 2024 16:04:31 +0800 Subject: [PATCH 050/416] mm/cma: change the addition of totalcma_pages in the cma_init_reserved_mem Replace the unnecessary division calculation with cma->count when update the value of totalcma_pages. Link: https://lkml.kernel.org/r/20240729080431.70916-1-hao.ge@linux.dev Signed-off-by: Hao Ge Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/cma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/cma.c b/mm/cma.c index 3e9724716bad..95d6950e177b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -202,7 +202,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, cma->order_per_bit = order_per_bit; *res_cma = cma; cma_area_count++; - totalcma_pages += (size / PAGE_SIZE); + totalcma_pages += cma->count; return 0; } From 7e60dcb22252d4415b66f7fb33e6d713e19036e0 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Tue, 30 Jul 2024 14:34:17 +0200 Subject: [PATCH 051/416] mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool Compiling z3fold.c results in several sparse warnings: z3fold.c:797:21: warning: incorrect type in initializer (different address spaces) z3fold.c:797:21: expected void const [noderef] __percpu *__vpp_verify z3fold.c:797:21: got struct list_head * z3fold.c:852:37: warning: incorrect type in initializer (different address spaces) z3fold.c:852:37: expected void const [noderef] __percpu *__vpp_verify z3fold.c:852:37: got struct list_head * z3fold.c:924:25: warning: incorrect type in assignment (different address spaces) z3fold.c:924:25: expected struct list_head *unbuddied z3fold.c:924:25: got void [noderef] __percpu *_res z3fold.c:930:33: warning: incorrect type in initializer (different address spaces) z3fold.c:930:33: expected void const [noderef] __percpu *__vpp_verify z3fold.c:930:33: got struct list_head * z3fold.c:949:25: warning: incorrect type in argument 1 (different address spaces) z3fold.c:949:25: expected void [noderef] __percpu *__pdata z3fold.c:949:25: got struct list_head *unbuddied z3fold.c:979:25: warning: incorrect type in argument 1 (different address spaces) z3fold.c:979:25: expected void [noderef] __percpu *__pdata z3fold.c:979:25: got struct list_head *unbuddied Add __percpu annotation to *unbuddied pointer to fix these warnings. Link: https://lkml.kernel.org/r/20240730123445.5875-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak Reviewed-by: Miaohe Lin Cc: Vitaly Wool Signed-off-by: Andrew Morton --- mm/z3fold.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 2ebfed32871b..379d24b4fef9 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -144,7 +144,7 @@ struct z3fold_pool { const char *name; spinlock_t lock; spinlock_t stale_lock; - struct list_head *unbuddied; + struct list_head __percpu *unbuddied; struct list_head stale; atomic64_t pages_nr; struct kmem_cache *c_handle; From 9f101bef408a3f70c44b6e4de44d3d4e2655ed10 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 30 Jul 2024 19:13:39 +1200 Subject: [PATCH 052/416] mm: swap: add nr argument in swapcache_prepare and swapcache_clear to support large folios Right now, swapcache_prepare() and swapcache_clear() supports one entry only, to support large folios, we need to handle multiple swap entries. To optimize stack usage, we iterate twice in __swap_duplicate(): the first time to verify that all entries are valid, and the second time to apply the modifications to the entries. Currently, we're using nr=1 for the existing users. [v-songbaohua@oppo.com: clarify swap_count_continued and improve readability for __swap_duplicate] Link: https://lkml.kernel.org/r/20240802071817.47081-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com Signed-off-by: Barry Song Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Tested-by: Baolin Wang Cc: Chris Li Cc: Gao Xiang Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kairui Song Cc: Kalesh Singh Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Minchan Kim Cc: Nhat Pham Cc: Ryan Roberts Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Yang Shi Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/swap.h | 4 +- mm/memory.c | 6 +-- mm/swap.h | 5 ++- mm/swap_state.c | 2 +- mm/swapfile.c | 97 +++++++++++++++++++++++++------------------- 5 files changed, 64 insertions(+), 50 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ba7ea95d1c57..5b920fa2315b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -480,7 +480,7 @@ extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); -extern int swapcache_prepare(swp_entry_t); +extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -554,7 +554,7 @@ static inline int swap_duplicate(swp_entry_t swp) return 0; } -static inline int swapcache_prepare(swp_entry_t swp) +static inline int swapcache_prepare(swp_entry_t swp, int nr) { return 0; } diff --git a/mm/memory.c b/mm/memory.c index 71236ae6fd14..cacf03bbdd7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4081,7 +4081,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * reusing the same entry. It's undetectable as * pte_same() returns true due to entry reuse. */ - if (swapcache_prepare(entry)) { + if (swapcache_prepare(entry, 1)) { /* Relax a bit to prevent rapid repeated page faults */ schedule_timeout_uninterruptible(1); goto out; @@ -4387,7 +4387,7 @@ unlock: out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear(si, entry, 1); if (si) put_swap_device(si); return ret; @@ -4403,7 +4403,7 @@ out_release: folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear(si, entry, 1); if (si) put_swap_device(si); return ret; diff --git a/mm/swap.h b/mm/swap.h index baa1fa946b34..7c6330561d84 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -59,7 +59,7 @@ void __delete_from_swap_cache(struct folio *folio, void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr); struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); struct folio *filemap_get_incore_folio(struct address_space *mapping, @@ -120,7 +120,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) return 0; } -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr) { } @@ -172,4 +172,5 @@ static inline unsigned int folio_swap_flags(struct folio *folio) return 0; } #endif /* CONFIG_SWAP */ + #endif /* _MM_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index a1726e49a5eb..b06f2a054f5a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -477,7 +477,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Swap entry may have been freed since our caller observed it. */ - err = swapcache_prepare(entry); + err = swapcache_prepare(entry, 1); if (!err) break; diff --git a/mm/swapfile.c b/mm/swapfile.c index 5f73a8553371..0259f36de715 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3363,7 +3363,7 @@ void si_swapinfo(struct sysinfo *val) } /* - * Verify that a swap entry is valid and increment its swap map count. + * Verify that nr swap entries are valid and increment their swap map counts. * * Returns error code in following case. * - success -> 0 @@ -3373,60 +3373,73 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> ENOMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) { struct swap_info_struct *p; struct swap_cluster_info *ci; unsigned long offset; unsigned char count; unsigned char has_cache; - int err; + int err, i; p = swp_swap_info(entry); offset = swp_offset(entry); + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + VM_WARN_ON(usage == 1 && nr > 1); ci = lock_cluster_or_swap_info(p, offset); - count = p->swap_map[offset]; + err = 0; + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; - /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. - */ - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { - err = -ENOENT; - goto unlock_out; + /* + * swapin_readahead() doesn't check if a swap entry is valid, so the + * swap entry could be SWAP_MAP_BAD. Check here with lock held. + */ + if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { + err = -ENOENT; + goto unlock_out; + } + + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; + + if (!count && !has_cache) { + err = -ENOENT; + } else if (usage == SWAP_HAS_CACHE) { + if (has_cache) + err = -EEXIST; + } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { + err = -EINVAL; + } + + if (err) + goto unlock_out; } - has_cache = count & SWAP_HAS_CACHE; - count &= ~SWAP_HAS_CACHE; - err = 0; + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; - if (usage == SWAP_HAS_CACHE) { - - /* set SWAP_HAS_CACHE if there is no cache and entry is used */ - if (!has_cache && count) + if (usage == SWAP_HAS_CACHE) has_cache = SWAP_HAS_CACHE; - else if (has_cache) /* someone else added cache */ - err = -EEXIST; - else /* no users remaining */ - err = -ENOENT; - - } else if (count || has_cache) { - - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) - err = -EINVAL; - else if (swap_count_continued(p, offset, count)) + else if (swap_count_continued(p, offset + i, count)) count = COUNT_CONTINUED; - else + else { + /* + * Don't need to rollback changes, because if + * usage == 1, there must be nr == 1. + */ err = -ENOMEM; - } else - err = -ENOENT; /* unused swap entry */ + goto unlock_out; + } - if (!err) - WRITE_ONCE(p->swap_map[offset], count | has_cache); + WRITE_ONCE(p->swap_map[offset + i], count | has_cache); + } unlock_out: unlock_cluster_or_swap_info(p, ci); @@ -3439,7 +3452,7 @@ unlock_out: */ void swap_shmem_alloc(swp_entry_t entry) { - __swap_duplicate(entry, SWAP_MAP_SHMEM); + __swap_duplicate(entry, SWAP_MAP_SHMEM, 1); } /* @@ -3453,29 +3466,29 @@ int swap_duplicate(swp_entry_t entry) { int err = 0; - while (!err && __swap_duplicate(entry, 1) == -ENOMEM) + while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM) err = add_swap_count_continuation(entry, GFP_ATOMIC); return err; } /* - * @entry: swap entry for which we allocate swap cache. + * @entry: first swap entry from which we allocate nr swap cache. * - * Called when allocating swap cache for existing swap entry, + * Called when allocating swap cache for existing swap entries, * This can return error codes. Returns 0 at success. * -EEXIST means there is a swap cache. * Note: return code is different from swap_duplicate(). */ -int swapcache_prepare(swp_entry_t entry) +int swapcache_prepare(swp_entry_t entry, int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE); + return __swap_duplicate(entry, SWAP_HAS_CACHE, nr); } -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr) { unsigned long offset = swp_offset(entry); - cluster_swap_free_nr(si, offset, 1, SWAP_HAS_CACHE); + cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) From f732e242841a9775b71744fa6b9bf6f25215b11f Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 26 Jul 2024 01:01:57 +0000 Subject: [PATCH 053/416] mm/memory_hotplug: get rid of __ref After commit 73db3abdca58 ("init/modpost: conditionally check section mismatch to __meminit*"), we can get rid of __ref annotations. Link: https://lkml.kernel.org/r/20240726010157.6177-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: David Hildenbrand Cc: Masahiro Yamada Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 66267c26ca1b..df291f2e509d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -366,7 +366,7 @@ struct page *pfn_to_online_page(unsigned long pfn) } EXPORT_SYMBOL_GPL(pfn_to_online_page); -int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, +int __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, struct mhp_params *params) { const unsigned long end_pfn = pfn + nr_pages; @@ -524,7 +524,7 @@ static void update_pgdat_span(struct pglist_data *pgdat) pgdat->node_spanned_pages = node_end_pfn - node_start_pfn; } -void __ref remove_pfn_range_from_zone(struct zone *zone, +void remove_pfn_range_from_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages) { @@ -629,7 +629,7 @@ int restore_online_page_callback(online_page_callback_t callback) EXPORT_SYMBOL_GPL(restore_online_page_callback); /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -void __ref generic_online_page(struct page *page, unsigned int order) +void generic_online_page(struct page *page, unsigned int order) { __free_pages_core(page, order, MEMINIT_HOTPLUG); } @@ -741,7 +741,7 @@ static inline void section_taint_zone_device(unsigned long pfn) * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related * zone stats (e.g., nr_isolate_pageblock) are touched. */ -void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, +void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap, int migratetype) { @@ -1143,7 +1143,7 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages) /* * Must be called with mem_hotplug_lock in write mode. */ -int __ref online_pages(unsigned long pfn, unsigned long nr_pages, +int online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { unsigned long flags; @@ -1233,7 +1233,7 @@ failed_addition: } /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -static pg_data_t __ref *hotadd_init_pgdat(int nid) +static pg_data_t *hotadd_init_pgdat(int nid) { struct pglist_data *pgdat; @@ -1386,7 +1386,7 @@ bool mhp_supports_memmap_on_memory(void) } EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory); -static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) +static void remove_memory_blocks_and_altmaps(u64 start, u64 size) { unsigned long memblock_size = memory_block_size_bytes(); u64 cur_start; @@ -1473,7 +1473,7 @@ out: * * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) +int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; enum memblock_flags memblock_flags = MEMBLOCK_NONE; @@ -1580,7 +1580,7 @@ error_mem_hotplug_end: } /* requires device_hotplug_lock, see add_memory_resource() */ -int __ref __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags) +int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags) { struct resource *res; int ret; @@ -1939,7 +1939,7 @@ static int count_system_ram_pages_cb(unsigned long start_pfn, /* * Must be called with mem_hotplug_lock in write mode. */ -int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, +int offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { const unsigned long end_pfn = start_pfn + nr_pages; @@ -2240,7 +2240,7 @@ static int memory_blocks_have_altmaps(u64 start, u64 size) return 1; } -static int __ref try_remove_memory(u64 start, u64 size) +static int try_remove_memory(u64 start, u64 size) { int rc, nid = NUMA_NO_NODE; From 94ccd21e9a5f41585bd16cc84dab2afa6cc21149 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 31 Jul 2024 16:20:00 +0200 Subject: [PATCH 054/416] mm/hugetlb: remove hugetlb_follow_page_mask() leftover We removed hugetlb_follow_page_mask() in commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") but forgot to cleanup some leftovers. While at it, simplify the hugetlb comment, it's overly detailed and rather confusing. Stating that we may end up in there during coredumping is sufficient to explain the PF_DUMPCORE usage. Link: https://lkml.kernel.org/r/20240731142000.625044-1-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Peter Xu Cc: Muchun Song Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 11 ++--------- include/linux/hugetlb.h | 3 --- 2 files changed, 2 insertions(+), 12 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index b3ed7207df7e..68cdd89c97a3 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,15 +371,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) unsigned int blocking_state; /* - * We don't do userfault handling for the final child pid update. - * - * We also don't do userfault handling during - * coredumping. hugetlbfs has the special - * hugetlb_follow_page_mask() to skip missing pages in the - * FOLL_DUMP case, anon memory also checks for FOLL_DUMP with - * the no_page_table() helper in follow_page_mask(), but the - * shmem_vm_ops->fault method is invoked even during - * coredumping and it ends up here. + * We don't do userfault handling for the final child pid update + * and when coredumping (faults triggered by get_dump_page()). */ if (current->flags & (PF_EXITING|PF_DUMPCORE)) goto out; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9b7bcfce6920..3100a52ceb73 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -127,9 +127,6 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long len); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, - unsigned long address, unsigned int flags, - unsigned int *page_mask); void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *, zap_flags_t); From 6654d28995d2d1d3e65c653f471a635cd21c99ea Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 10 Jul 2024 23:43:50 +0200 Subject: [PATCH 055/416] mm/rmap: cleanup partially-mapped handling in __folio_remove_rmap() Let's simplify and reduce code indentation. In the RMAP_LEVEL_PTE case, we already check for nr when computing partially_mapped. For RMAP_LEVEL_PMD, it's a bit more confusing. Likely, we don't need the "nr" check, but we could have "nr < nr_pmdmapped" also if we stumbled into the "/* Raced ahead of another remove and an add? */" case. So let's simply move the nr check in there. Note that partially_mapped is always false for small folios. No functional change intended. Link: https://lkml.kernel.org/r/20240710214350.147864-1-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Reviewed-by: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/rmap.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 2630bde38640..901950200957 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1568,22 +1568,19 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, } } - partially_mapped = nr < nr_pmdmapped; + partially_mapped = nr && nr < nr_pmdmapped; break; } - if (nr) { - /* - * Queue anon large folio for deferred split if at least one - * page of the folio is unmapped and at least one page - * is still mapped. - * - * Check partially_mapped first to ensure it is a large folio. - */ - if (folio_test_anon(folio) && partially_mapped && - list_empty(&folio->_deferred_list)) - deferred_split_folio(folio); - } + /* + * Queue anon large folio for deferred split if at least one page of + * the folio is unmapped and at least one page is still mapped. + * + * Check partially_mapped first to ensure it is a large folio. + */ + if (partially_mapped && folio_test_anon(folio) && + list_empty(&folio->_deferred_list)) + deferred_split_folio(folio); __folio_mod_stat(folio, -nr, -nr_pmdmapped); /* From 17d5f38b33b66210c5a46af639c47fddfe1c5b4d Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 31 Jul 2024 18:07:58 +0200 Subject: [PATCH 056/416] mm: clarify folio_likely_mapped_shared() documentation for KSM folios For KSM folios, the function actually does what it is supposed to do: even having multiple mappings inside the same MM is considered "sharing", as there is no real relationship between these KSM page mappings -- in contrast to mapping the same file range twice and having the same pagecache page mapped twice. Link: https://lkml.kernel.org/r/20240731160758.808925-1-david@redhat.com Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton --- include/linux/mm.h | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index bd219ac9c026..2f6c08b53e4f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2132,14 +2132,19 @@ static inline size_t folio_size(const struct folio *folio) * MM ("mapped shared"), or if the folio is only mapped into a single MM * ("mapped exclusively"). * + * For KSM folios, this function also returns "mapped shared" when a folio is + * mapped multiple times into the same MM, because the individual page mappings + * are independent. + * * As precise information is not easily available for all folios, this function * estimates the number of MMs ("sharers") that are currently mapping a folio * using the number of times the first page of the folio is currently mapped * into page tables. * - * For small anonymous folios (except KSM folios) and anonymous hugetlb folios, - * the return value will be exactly correct, because they can only be mapped - * at most once into an MM, and they cannot be partially mapped. + * For small anonymous folios and anonymous hugetlb folios, the return + * value will be exactly correct: non-KSM folios can only be mapped at most once + * into an MM, and they cannot be partially mapped. KSM folios are + * considered shared even if mapped multiple times into the same MM. * * For other folios, the result can be fuzzy: * #. For partially-mappable large folios (THP), the return value can wrongly @@ -2148,9 +2153,6 @@ static inline size_t folio_size(const struct folio *folio) * #. For pagecache folios (including hugetlb), the return value can wrongly * indicate "mapped shared" (false positive) when two VMAs in the same MM * cover the same file range. - * #. For (small) KSM folios, the return value can wrongly indicate "mapped - * shared" (false positive), when the folio is mapped multiple times into - * the same MM. * * Further, this function only considers current page table mappings that * are tracked using the folio mapcount(s). From 1d3440305e07c166c3752edb5e212bf324cf3252 Mon Sep 17 00:00:00 2001 From: Zhaoyu Liu Date: Wed, 31 Jul 2024 21:31:01 +0800 Subject: [PATCH 057/416] mm: swap: allocate folio only first time in __read_swap_cache_async() It should be checked by filemap_get_folio() if SWAP_HAS_CACHE was marked while reading a share swap page. It would re-allocate a folio if the swap cache was not ready now. We save the new folio to avoid page allocating again. Link: https://lkml.kernel.org/r/20240731133101.GA2096752@bytedance Signed-off-by: Zhaoyu Liu Cc: Domenico Cerasuolo Cc: Kairui Song Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/swap_state.c | 58 ++++++++++++++++++++++++++----------------------- 1 file changed, 31 insertions(+), 27 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index b06f2a054f5a..40c681ae9c3c 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -435,6 +435,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, { struct swap_info_struct *si; struct folio *folio; + struct folio *new_folio = NULL; + struct folio *result = NULL; void *shadow = NULL; *new_page_allocated = false; @@ -463,16 +465,19 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * else swap_off will be aborted if we return NULL. */ if (!swap_swapcount(si, entry) && swap_slot_cache_enabled) - goto fail_put_swap; + goto put_and_return; /* - * Get a new folio to read into from swap. Allocate it now, - * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will - * cause any racers to loop around until we add it to cache. + * Get a new folio to read into from swap. Allocate it now if + * new_folio not exist, before marking swap_map SWAP_HAS_CACHE, + * when -EEXIST will cause any racers to loop around until we + * add it to cache. */ - folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); - if (!folio) - goto fail_put_swap; + if (!new_folio) { + new_folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id()); + if (!new_folio) + goto put_and_return; + } /* * Swap entry may have been freed since our caller observed it. @@ -480,10 +485,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, err = swapcache_prepare(entry, 1); if (!err) break; - - folio_put(folio); - if (err != -EEXIST) - goto fail_put_swap; + else if (err != -EEXIST) + goto put_and_return; /* * Protect against a recursive call to __read_swap_cache_async() @@ -494,7 +497,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * __read_swap_cache_async() in the writeback path. */ if (skip_if_exists) - goto fail_put_swap; + goto put_and_return; /* * We might race against __delete_from_swap_cache(), and @@ -509,36 +512,37 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * The swap entry is ours to swap in. Prepare the new folio. */ + __folio_set_locked(new_folio); + __folio_set_swapbacked(new_folio); - __folio_set_locked(folio); - __folio_set_swapbacked(folio); - - if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry)) + if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry)) goto fail_unlock; /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) + if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) goto fail_unlock; mem_cgroup_swapin_uncharge_swap(entry); if (shadow) - workingset_refault(folio, shadow); + workingset_refault(new_folio, shadow); - /* Caller will initiate read into locked folio */ - folio_add_lru(folio); + /* Caller will initiate read into locked new_folio */ + folio_add_lru(new_folio); *new_page_allocated = true; + folio = new_folio; got_folio: - put_swap_device(si); - return folio; + result = folio; + goto put_and_return; fail_unlock: - put_swap_folio(folio, entry); - folio_unlock(folio); - folio_put(folio); -fail_put_swap: + put_swap_folio(new_folio, entry); + folio_unlock(new_folio); +put_and_return: put_swap_device(si); - return NULL; + if (!(*new_page_allocated) && new_folio) + folio_put(new_folio); + return result; } /* From c5519e0a9bfb6365445de0521a6fe7000b375ecf Mon Sep 17 00:00:00 2001 From: Takero Funaki Date: Wed, 31 Jul 2024 00:49:09 +0000 Subject: [PATCH 058/416] mm: zswap: fix global shrinker memcg iteration Patch series "mm: zswap: fixes for global shrinker", v5. This series addresses issues in the zswap global shrinker that could not shrink stored pages. With this series, the shrinker continues to shrink pages until it reaches the accept threshold more reliably, gives much higher writeback when the zswap pool limit is hit. This patch (of 2): This patch fixes an issue where the zswap global shrinker stopped iterating through the memcg tree. The problem was that shrink_worker() would restart iterating memcg tree from the tree root, considering an offline memcg as a failure, and abort shrinking after encountering the same offline memcg 16 times even if there is only one offline memcg. After this change, an offline memcg in the tree is no longer considered a failure. This allows the shrinker to continue shrinking the other online memcgs regardless of whether an offline memcg exists, gives higher zswap writeback activity. To avoid holding refcount of offline memcg encountered during the memcg tree walking, shrink_worker() must continue iterating to release the offline memcg to ensure the next memcg stored in the cursor is online. The offline memcg cleaner has also been changed to avoid the same issue. When the next memcg of the offlined memcg is also offline, the refcount stored in the iteration cursor was held until the next shrink_worker() run. The cleaner must release the offline memcg recursively. [yosryahmed@google.com: make critical section more obvious, unify comments] Link: https://lkml.kernel.org/r/CAJD7tkaScz+SbB90Q1d5mMD70UfM2a-J2zhXDT9sePR7Qap45Q@mail.gmail.com Link: https://lkml.kernel.org/r/20240731004918.33182-1-flintglass@gmail.com Link: https://lkml.kernel.org/r/20240731004918.33182-2-flintglass@gmail.com Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki Signed-off-by: Yosry Ahmed Acked-by: Yosry Ahmed Reviewed-by: Chengming Zhou Reviewed-by: Nhat Pham Cc: Chengming Zhou Cc: Johannes Weiner Signed-off-by: Andrew Morton --- mm/zswap.c | 86 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 52 insertions(+), 34 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index adeaf9c97fde..bd97f9824d8a 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -765,12 +765,25 @@ void zswap_folio_swapin(struct folio *folio) } } +/* + * This function should be called when a memcg is being offlined. + * + * Since the global shrinker shrink_worker() may hold a reference + * of the memcg, we must check and release the reference in + * zswap_next_shrink. + * + * shrink_worker() must handle the case where this function releases + * the reference of memcg being shrunk. + */ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) { /* lock out zswap shrinker walking memcg tree */ spin_lock(&zswap_shrink_lock); - if (zswap_next_shrink == memcg) - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); + if (zswap_next_shrink == memcg) { + do { + zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); + } while (zswap_next_shrink && !mem_cgroup_online(zswap_next_shrink)); + } spin_unlock(&zswap_shrink_lock); } @@ -1304,44 +1317,49 @@ static void shrink_worker(struct work_struct *w) /* Reclaim down to the accept threshold */ thr = zswap_accept_thr_pages(); - /* global reclaim will select cgroup in a round-robin fashion. */ + /* + * Global reclaim will select cgroup in a round-robin fashion. + * + * We save iteration cursor memcg into zswap_next_shrink, + * which can be modified by the offline memcg cleaner + * zswap_memcg_offline_cleanup(). + * + * Since the offline cleaner is called only once, we cannot leave an + * offline memcg reference in zswap_next_shrink. + * We can rely on the cleaner only if we get online memcg under lock. + * + * If we get an offline memcg, we cannot determine if the cleaner has + * already been called or will be called later. We must put back the + * reference before returning from this function. Otherwise, the + * offline memcg left in zswap_next_shrink will hold the reference + * until the next run of shrink_worker(). + */ do { - spin_lock(&zswap_shrink_lock); - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); - memcg = zswap_next_shrink; - /* - * We need to retry if we have gone through a full round trip, or if we - * got an offline memcg (or else we risk undoing the effect of the - * zswap memcg offlining cleanup callback). This is not catastrophic - * per se, but it will keep the now offlined memcg hostage for a while. + * Start shrinking from the next memcg after zswap_next_shrink. + * When the offline cleaner has already advanced the cursor, + * advancing the cursor here overlooks one memcg, but this + * should be negligibly rare. * - * Note that if we got an online memcg, we will keep the extra - * reference in case the original reference obtained by mem_cgroup_iter - * is dropped by the zswap memcg offlining callback, ensuring that the - * memcg is not killed when we are reclaiming. + * If we get an online memcg, keep the extra reference in case + * the original one obtained by mem_cgroup_iter() is dropped by + * zswap_memcg_offline_cleanup() while we are shrinking the + * memcg. */ - if (!memcg) { - spin_unlock(&zswap_shrink_lock); - if (++failures == MAX_RECLAIM_RETRIES) - break; - - goto resched; - } - - if (!mem_cgroup_tryget_online(memcg)) { - /* drop the reference from mem_cgroup_iter() */ - mem_cgroup_iter_break(NULL, memcg); - zswap_next_shrink = NULL; - spin_unlock(&zswap_shrink_lock); - - if (++failures == MAX_RECLAIM_RETRIES) - break; - - goto resched; - } + spin_lock(&zswap_shrink_lock); + do { + memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL); + zswap_next_shrink = memcg; + } while (memcg && !mem_cgroup_tryget_online(memcg)); spin_unlock(&zswap_shrink_lock); + if (!memcg) { + if (++failures == MAX_RECLAIM_RETRIES) + break; + + goto resched; + } + ret = shrink_memcg(memcg); /* drop the extra reference */ mem_cgroup_put(memcg); From 81920438a6dce78e6eac17ee34f29bf6c32074b1 Mon Sep 17 00:00:00 2001 From: Takero Funaki Date: Wed, 31 Jul 2024 00:49:10 +0000 Subject: [PATCH 059/416] mm: zswap: fix global shrinker error handling logic This patch fixes the zswap global shrinker, which did not shrink the zpool as expected. The issue addressed is that shrink_worker() did not distinguish between unexpected errors and expected errors, such as failed writeback from an empty memcg. The shrinker would stop shrinking after iterating through the memcg tree 16 times, even if there was only one empty memcg. With this patch, the shrinker no longer considers encountering an empty memcg, encountering a memcg with writeback disabled, or reaching the end of a memcg tree walk as a failure, as long as there are memcgs that are candidates for writeback. Systems with one or more empty memcgs will now observe significantly higher zswap writeback activity after the zswap pool limit is hit. To avoid an infinite loop when there are no writeback candidates, this patch tracks writeback attempts during memcg tree walks and limits reties if no writeback candidates are found. To handle the empty memcg case, the helper function shrink_memcg() is modified to check if the memcg is empty and then return -ENOENT. Link: https://lkml.kernel.org/r/20240731004918.33182-3-flintglass@gmail.com Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki Reviewed-by: Chengming Zhou Reviewed-by: Nhat Pham Cc: Johannes Weiner Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/zswap.c | 40 +++++++++++++++++++++++++++++++++------- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index bd97f9824d8a..71b75ff1f3fb 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1287,10 +1287,10 @@ static struct shrinker *zswap_alloc_shrinker(void) static int shrink_memcg(struct mem_cgroup *memcg) { - int nid, shrunk = 0; + int nid, shrunk = 0, scanned = 0; if (!mem_cgroup_zswap_writeback_enabled(memcg)) - return -EINVAL; + return -ENOENT; /* * Skip zombies because their LRUs are reparented and we would be @@ -1304,21 +1304,34 @@ static int shrink_memcg(struct mem_cgroup *memcg) shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb, NULL, &nr_to_walk); + scanned += 1 - nr_to_walk; } + + if (!scanned) + return -ENOENT; + return shrunk ? 0 : -EAGAIN; } static void shrink_worker(struct work_struct *w) { struct mem_cgroup *memcg; - int ret, failures = 0; + int ret, failures = 0, attempts = 0; unsigned long thr; /* Reclaim down to the accept threshold */ thr = zswap_accept_thr_pages(); /* - * Global reclaim will select cgroup in a round-robin fashion. + * Global reclaim will select cgroup in a round-robin fashion from all + * online memcgs, but memcgs that have no pages in zswap and + * writeback-disabled memcgs (memory.zswap.writeback=0) are not + * candidates for shrinking. + * + * Shrinking will be aborted if we encounter the following + * MAX_RECLAIM_RETRIES times: + * - No writeback-candidate memcgs found in a memcg tree walk. + * - Shrinking a writeback-candidate memcg failed. * * We save iteration cursor memcg into zswap_next_shrink, * which can be modified by the offline memcg cleaner @@ -1354,9 +1367,14 @@ static void shrink_worker(struct work_struct *w) spin_unlock(&zswap_shrink_lock); if (!memcg) { - if (++failures == MAX_RECLAIM_RETRIES) + /* + * Continue shrinking without incrementing failures if + * we found candidate memcgs in the last tree walk. + */ + if (!attempts && ++failures == MAX_RECLAIM_RETRIES) break; + attempts = 0; goto resched; } @@ -1364,8 +1382,16 @@ static void shrink_worker(struct work_struct *w) /* drop the extra reference */ mem_cgroup_put(memcg); - if (ret == -EINVAL) - break; + /* + * There are no writeback-candidate pages in the memcg. + * This is not an issue as long as we can find another memcg + * with pages in zswap. Skip this without incrementing attempts + * and failures. + */ + if (ret == -ENOENT) + continue; + ++attempts; + if (ret && ++failures == MAX_RECLAIM_RETRIES) break; resched: From 6d192303e82c7f7119020418a4c3029bebf9a0e6 Mon Sep 17 00:00:00 2001 From: Kaiyang Zhao Date: Thu, 1 Aug 2024 18:04:56 +0000 Subject: [PATCH 060/416] mm: consider CMA pages in watermark check for NUMA balancing target node Currently in migrate_balanced_pgdat(), ALLOC_CMA flag is not passed when checking watermark on the migration target node. This does not match the gfp in alloc_misplaced_dst_folio() which allows allocation from CMA. This causes promotion failures when there are a lot of available CMA memory in the system. Therefore, we change the alloc_flags passed to zone_watermark_ok() in migrate_balanced_pgdat(). Link: https://lkml.kernel.org/r/20240801180456.25927-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao Acked-by: Johannes Weiner Reviewed-by: Baolin Wang Signed-off-by: Andrew Morton --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8578a930cad1..aa482c954cb0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2526,7 +2526,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, if (!zone_watermark_ok(zone, 0, high_wmark_pages(zone) + nr_migrate_pages, - ZONE_MOVABLE, 0)) + ZONE_MOVABLE, ALLOC_CMA)) continue; return true; } From 03790c51a475c46647079e67e2460a149938bfd6 Mon Sep 17 00:00:00 2001 From: Kaiyang Zhao Date: Thu, 1 Aug 2024 23:25:47 +0000 Subject: [PATCH 061/416] mm: create promo_wmark_pages and clean up open-coded sites Patch series "mm: print the promo watermark in zoneinfo", v2. This patch (of 2): Define promo_wmark_pages and convert current call sites of wmark_pages with fixed WMARK_PROMO to using it instead. Link: https://lkml.kernel.org/r/20240801232548.36604-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240801232548.36604-2-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao Cc: Johannes Weiner Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 1 + kernel/sched/fair.c | 2 +- mm/vmscan.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dbb69f96a152..9f3589f7f05c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -669,6 +669,7 @@ enum zone_watermarks { #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) #define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define promo_wmark_pages(z) (z->_watermark[WMARK_PROMO] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 416e29b56cc4..20ea6d085305 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1742,7 +1742,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) continue; if (zone_watermark_ok(zone, 0, - wmark_pages(zone, WMARK_PROMO) + enough_wmark, + promo_wmark_pages(zone) + enough_wmark, ZONE_MOVABLE, 0)) return true; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 852d97c134df..6b70e80ca0ee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6669,7 +6669,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx) continue; if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) - mark = wmark_pages(zone, WMARK_PROMO); + mark = promo_wmark_pages(zone); else mark = high_wmark_pages(zone); if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx)) From 528afe6b9605dee42798c78deadba62e02e89d0f Mon Sep 17 00:00:00 2001 From: Kaiyang Zhao Date: Thu, 1 Aug 2024 23:25:48 +0000 Subject: [PATCH 062/416] mm: print the promo watermark in zoneinfo Print the promo watermark in zoneinfo just like other watermarks. This helps users check and verify all the watermarks are appropriate. Link: https://lkml.kernel.org/r/20240801232548.36604-3-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao Cc: Johannes Weiner Signed-off-by: Andrew Morton --- mm/vmstat.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/vmstat.c b/mm/vmstat.c index 2dc97baf7d62..aea58e9fce60 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1743,6 +1743,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n min %lu" "\n low %lu" "\n high %lu" + "\n promo %lu" "\n spanned %lu" "\n present %lu" "\n managed %lu" @@ -1752,6 +1753,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone), + promo_wmark_pages(zone), zone->spanned_pages, zone->present_pages, zone_managed_pages(zone), From 620943d7ee69070df8844235d58843af48ac70e2 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 1 Aug 2024 16:50:05 -0700 Subject: [PATCH 063/416] include/linux/mmzone.h: clean up watermark accessors - we have a helper wmark_pages(). Teach min_wmark_pages(), low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use it instead of open-coding its implementation. - there's no reason to implement all these things as macros. Redo them in C. Acked-by: Johannes Weiner Cc: Kaiyang Zhao Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9f3589f7f05c..17506e4a2835 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -666,12 +666,6 @@ enum zone_watermarks { #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1)) #define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP) -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) -#define promo_wmark_pages(z) (z->_watermark[WMARK_PROMO] + z->watermark_boost) -#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) - /* * Flags used in pcp->flags field. * @@ -1017,6 +1011,32 @@ enum zone_flags { ZONE_BELOW_HIGH, /* zone is below high watermark. */ }; +static inline unsigned long wmark_pages(const struct zone *z, + enum zone_watermarks w) +{ + return z->_watermark[w] + z->watermark_boost; +} + +static inline unsigned long min_wmark_pages(const struct zone *z) +{ + return wmark_pages(z, WMARK_MIN); +} + +static inline unsigned long low_wmark_pages(const struct zone *z) +{ + return wmark_pages(z, WMARK_LOW); +} + +static inline unsigned long high_wmark_pages(const struct zone *z) +{ + return wmark_pages(z, WMARK_HIGH); +} + +static inline unsigned long promo_wmark_pages(const struct zone *z) +{ + return wmark_pages(z, WMARK_PROMO); +} + static inline unsigned long zone_managed_pages(struct zone *zone) { return (unsigned long)atomic_long_read(&zone->managed_pages); From 3523a37e657c7cc97723a54e9dfacca6c003e4c1 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:14 +0200 Subject: [PATCH 064/416] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES Patch series "mm: replace follow_page() by folio_walk". Looking into a way of moving the last folio_likely_mapped_shared() call in add_folio_for_migration() under the PTL, I found myself removing follow_page(). This paves the way for cleaning up all the FOLL_, follow_* terminology to just be called "GUP" nowadays. The new page table walker will lookup a mapped folio and return to the caller with the PTL held, such that the folio cannot get unmapped concurrently. Callers can then conditionally decide whether they really want to take a short-term folio reference or whether the can simply unlock the PTL and be done with it. folio_walk is similar to page_vma_mapped_walk(), except that we don't know the folio we want to walk to and that we are only walking to exactly one PTE/PMD/PUD. folio_walk provides access to the pte/pmd/pud (and the referenced folio page because things like KSM need that), however, as part of this series no page table modifications are performed by users. We might be able to convert some other walk_page_range() users that really only walk to one address, such as DAMON with damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk in the future to optionally fault in a folio (if applicable), such that we can replace some get_user_pages() users that really only want to lookup a single page/folio under PTL without unconditionally grabbing a folio reference. I have plans to extend the approach to a range walker that will try batching various page table entries (not just folio pages) to be a better replace for walk_page_range() -- and users will be able to opt in which type of page table entries they want to process -- but that will require more work and more thoughts. KSM seems to work just fine (ksm_functional_tests selftests) and move_pages seems to work (migration selftest). I tested the leaf implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G) on arm64 using move_pages and did some more testing on x86-64. Cross compiled on a bunch of architectures. This patch (of 11): We want to make use of vm_normal_page_pmd() in generic page table walking code where we might walk hugetlb folios that are mapped by PMDs even without CONFIG_TRANSPARENT_HUGEPAGE. So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES. Link: https://lkml.kernel.org/r/20240802155524.517137-1-david@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-2-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index cacf03bbdd7a..661168a46b43 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -666,7 +666,7 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr, return NULL; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd) { From aa39ca6940f1a0540f2984051b3089972f42959b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:15 +0200 Subject: [PATCH 065/416] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() We want to get rid of follow_page(), and have a more reasonable way to just lookup a folio mapped at a certain address, perform some checks while still under PTL, and then only conditionally grab a folio reference if really required. Further, we might want to get rid of some walk_page_range*() users that really only want to temporarily lookup a single folio at a single address. So let's add a new page table walker that does exactly that, similarly to GUP also being able to walk hugetlb VMAs. Add folio_walk_end() as a macro for now: the compiler is not easy to please with the pte_unmap()->kunmap_local(). Note that one difference between follow_page() and get_user_pages(1) is that follow_page() will not trigger faults to get something mapped. So folio_walk is at least currently not a replacement for get_user_pages(1), but could likely be extended/reused to achieve something similar in the future. Link: https://lkml.kernel.org/r/20240802155524.517137-3-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/pagewalk.h | 58 +++++++++++ mm/pagewalk.c | 202 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 260 insertions(+) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 27cd1e59ccf7..f5eb5a32aeed 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -130,4 +130,62 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, pgoff_t nr, const struct mm_walk_ops *ops, void *private); +typedef int __bitwise folio_walk_flags_t; + +/* + * Walk migration entries as well. Careful: a large folio might get split + * concurrently. + */ +#define FW_MIGRATION ((__force folio_walk_flags_t)BIT(0)) + +/* Walk shared zeropages (small + huge) as well. */ +#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(1)) + +enum folio_walk_level { + FW_LEVEL_PTE, + FW_LEVEL_PMD, + FW_LEVEL_PUD, +}; + +/** + * struct folio_walk - folio_walk_start() / folio_walk_end() data + * @page: exact folio page referenced (if applicable) + * @level: page table level identifying the entry type + * @pte: pointer to the page table entry (FW_LEVEL_PTE). + * @pmd: pointer to the page table entry (FW_LEVEL_PMD). + * @pud: pointer to the page table entry (FW_LEVEL_PUD). + * @ptl: pointer to the page table lock. + * + * (see folio_walk_start() documentation for more details) + */ +struct folio_walk { + /* public */ + struct page *page; + enum folio_walk_level level; + union { + pte_t *ptep; + pud_t *pudp; + pmd_t *pmdp; + }; + union { + pte_t pte; + pud_t pud; + pmd_t pmd; + }; + /* private */ + struct vm_area_struct *vma; + spinlock_t *ptl; +}; + +struct folio *folio_walk_start(struct folio_walk *fw, + struct vm_area_struct *vma, unsigned long addr, + folio_walk_flags_t flags); + +#define folio_walk_end(__fw, __vma) do { \ + spin_unlock((__fw)->ptl); \ + if (likely((__fw)->level == FW_LEVEL_PTE)) \ + pte_unmap((__fw)->ptep); \ + vma_pgtable_walk_end(__vma); \ +} while (0) + #endif /* _LINUX_PAGEWALK_H */ diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ae2f08ce991b..cd79fb3b89e5 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include /* * We want to know the real level where a entry is located ignoring any @@ -654,3 +656,203 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, return err; } + +/** + * folio_walk_start - walk the page tables to a folio + * @fw: filled with information on success. + * @vma: the VMA. + * @addr: the virtual address to use for the page table walk. + * @flags: flags modifying which folios to walk to. + * + * Walk the page tables using @addr in a given @vma to a mapped folio and + * return the folio, making sure that the page table entry referenced by + * @addr cannot change until folio_walk_end() was called. + * + * As default, this function returns only folios that are not special (e.g., not + * the zeropage) and never returns folios that are supposed to be ignored by the + * VM as documented by vm_normal_page(). If requested, zeropages will be + * returned as well. + * + * As default, this function only considers present page table entries. + * If requested, it will also consider migration entries. + * + * If this function returns NULL it might either indicate "there is nothing" or + * "there is nothing suitable". + * + * On success, @fw is filled and the function returns the folio while the PTL + * is still held and folio_walk_end() must be called to clean up, + * releasing any held locks. The returned folio must *not* be used after the + * call to folio_walk_end(), unless a short-term folio reference is taken before + * that call. + * + * @fw->page will correspond to the page that is effectively referenced by + * @addr. However, for migration entries and shared zeropages @fw->page is + * set to NULL. Note that large folios might be mapped by multiple page table + * entries, and this function will always only lookup a single entry as + * specified by @addr, which might or might not cover more than a single page of + * the returned folio. + * + * This function must *not* be used as a naive replacement for + * get_user_pages() / pin_user_pages(), especially not to perform DMA or + * to carelessly modify page content. This function may *only* be used to grab + * short-term folio references, never to grab long-term folio references. + * + * Using the page table entry pointers in @fw for reading or modifying the + * entry should be avoided where possible: however, there might be valid + * use cases. + * + * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care. + * For example, PMD page table sharing might require prior unsharing. Also, + * logical hugetlb entries might span multiple physical page table entries, + * which *must* be modified in a single operation (set_huge_pte_at(), + * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might + * not correspond to the first physical entry of a logical hugetlb entry. + * + * The mmap lock must be held in read mode. + * + * Return: folio pointer on success, otherwise NULL. + */ +struct folio *folio_walk_start(struct folio_walk *fw, + struct vm_area_struct *vma, unsigned long addr, + folio_walk_flags_t flags) +{ + unsigned long entry_size; + bool expose_page = true; + struct page *page; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; + pte_t *ptep, pte; + spinlock_t *ptl; + pgd_t *pgdp; + p4d_t *p4dp; + + mmap_assert_locked(vma->vm_mm); + vma_pgtable_walk_begin(vma); + + if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end)) + goto not_found; + + pgdp = pgd_offset(vma->vm_mm, addr); + if (pgd_none_or_clear_bad(pgdp)) + goto not_found; + + p4dp = p4d_offset(pgdp, addr); + if (p4d_none_or_clear_bad(p4dp)) + goto not_found; + + pudp = pud_offset(p4dp, addr); + pud = pudp_get(pudp); + if (pud_none(pud)) + goto not_found; + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pud_leaf(pud)) { + ptl = pud_lock(vma->vm_mm, pudp); + pud = pudp_get(pudp); + + entry_size = PUD_SIZE; + fw->level = FW_LEVEL_PUD; + fw->pudp = pudp; + fw->pud = pud; + + if (!pud_present(pud) || pud_devmap(pud)) { + spin_unlock(ptl); + goto not_found; + } else if (!pud_leaf(pud)) { + spin_unlock(ptl); + goto pmd_table; + } + /* + * TODO: vm_normal_page_pud() will be handy once we want to + * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs. + */ + page = pud_page(pud); + goto found; + } + +pmd_table: + VM_WARN_ON_ONCE(pud_leaf(*pudp)); + pmdp = pmd_offset(pudp, addr); + pmd = pmdp_get_lockless(pmdp); + if (pmd_none(pmd)) + goto not_found; + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pmd_leaf(pmd)) { + ptl = pmd_lock(vma->vm_mm, pmdp); + pmd = pmdp_get(pmdp); + + entry_size = PMD_SIZE; + fw->level = FW_LEVEL_PMD; + fw->pmdp = pmdp; + fw->pmd = pmd; + + if (pmd_none(pmd)) { + spin_unlock(ptl); + goto not_found; + } else if (!pmd_leaf(pmd)) { + spin_unlock(ptl); + goto pte_table; + } else if (pmd_present(pmd)) { + page = vm_normal_page_pmd(vma, addr, pmd); + if (page) { + goto found; + } else if ((flags & FW_ZEROPAGE) && + is_huge_zero_pmd(pmd)) { + page = pfn_to_page(pmd_pfn(pmd)); + expose_page = false; + goto found; + } + } else if ((flags & FW_MIGRATION) && + is_pmd_migration_entry(pmd)) { + swp_entry_t entry = pmd_to_swp_entry(pmd); + + page = pfn_swap_entry_to_page(entry); + expose_page = false; + goto found; + } + spin_unlock(ptl); + goto not_found; + } + +pte_table: + VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); + ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl); + if (!ptep) + goto not_found; + pte = ptep_get(ptep); + + entry_size = PAGE_SIZE; + fw->level = FW_LEVEL_PTE; + fw->ptep = ptep; + fw->pte = pte; + + if (pte_present(pte)) { + page = vm_normal_page(vma, addr, pte); + if (page) + goto found; + if ((flags & FW_ZEROPAGE) && + is_zero_pfn(pte_pfn(pte))) { + page = pfn_to_page(pte_pfn(pte)); + expose_page = false; + goto found; + } + } else if (!pte_none(pte)) { + swp_entry_t entry = pte_to_swp_entry(pte); + + if ((flags & FW_MIGRATION) && + is_migration_entry(entry)) { + page = pfn_swap_entry_to_page(entry); + expose_page = false; + goto found; + } + } + pte_unmap_unlock(ptep, ptl); +not_found: + vma_pgtable_walk_end(vma); + return NULL; +found: + if (expose_page) + /* Note: Offset from the mapped page, not the folio start. */ + fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); + else + fw->page = NULL; + fw->ptl = ptl; + return page_folio(page); +} From 46d6a9b4450b4f5ebf6e62d03f45800b70221c4f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:16 +0200 Subject: [PATCH 066/416] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk Let's use folio_walk instead, so we can avoid taking a folio reference just to read the nid and get rid of another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented. The possible return values for follow_page() were confusing, especially with FOLL_DUMP set. We'll handle it like documented in the man page: * -EFAULT: This is a zero page or the memory area is not mapped by the process. * -ENOENT: The page is not present. We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to do, but it likely doesn't really matter (just like for weird devmap, whereby we fake "not present"). Note that the other errors (-EACCESS, -EBUSY, -EIO, -EINVAL, -ENOMEM) so far only applied when actually moving pages, not when only querying stats. We'll effectively drop the "secretmem" check we had in follow_page(), but that shouldn't really matter here, we're not accessing folio/page content after all. Link: https://lkml.kernel.org/r/20240802155524.517137-4-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index aa482c954cb0..b5365a434ba9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -2331,28 +2332,26 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages, for (i = 0; i < nr_pages; i++) { unsigned long addr = (unsigned long)(*pages); struct vm_area_struct *vma; - struct page *page; + struct folio_walk fw; + struct folio *folio; int err = -EFAULT; vma = vma_lookup(mm, addr); if (!vma) goto set_status; - /* FOLL_DUMP to ignore special (like zero) pages */ - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); - - err = PTR_ERR(page); - if (IS_ERR(page)) - goto set_status; - - err = -ENOENT; - if (!page) - goto set_status; - - if (!is_zone_device_page(page)) - err = page_to_nid(page); - - put_page(page); + folio = folio_walk_start(&fw, vma, addr, FW_ZEROPAGE); + if (folio) { + if (is_zero_folio(folio) || is_huge_zero_folio(folio)) + err = -EFAULT; + else if (folio_is_zone_device(folio)) + err = -ENOENT; + else + err = folio_nid(folio); + folio_walk_end(&fw, vma); + } else { + err = -ENOENT; + } set_status: *status = err; From 7dff875c9436a9df2f93cf59a32630761565af99 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:17 +0200 Subject: [PATCH 067/416] mm/migrate: convert add_page_for_migration() from follow_page() to folio_walk Let's use folio_walk instead, so we can avoid taking a folio reference when we won't even be trying to migrate the folio and to get rid of another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented. We now perform the folio_likely_mapped_shared() check under PTL, which is what we want: relying on the mapcount and friends after dropping the PTL does not make too much sense, as the page can get unmapped concurrently from this process. Further, we perform the folio isolation under PTL, similar to how we handle it for MADV_PAGEOUT. The possible return values for follow_page() were confusing, especially with FOLL_DUMP set. We'll handle it like documented in the man page: * -EFAULT: This is a zero page or the memory area is not mapped by the process. * -ENOENT: The page is not present. We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to do, but it likely doesn't really matter (just like for weird devmap, whereby we fake "not present"). The other errros are left as is, and match the documentation in the man page. While at it, rename add_page_for_migration() to add_folio_for_migration(). We'll lose the "secretmem" check, but that shouldn't really matter because these folios cannot ever be migrated. Should vma_migratable() refuse these VMAs? Maybe. Link: https://lkml.kernel.org/r/20240802155524.517137-5-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate.c | 104 +++++++++++++++++++++++---------------------------- 1 file changed, 47 insertions(+), 57 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index b5365a434ba9..e1383d9cc944 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2112,76 +2112,66 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node) return err; } -/* - * Resolves the given address to a struct page, isolates it from the LRU and - * puts it to the given pagelist. - * Returns: - * errno - if the page cannot be found/isolated - * 0 - when it doesn't have to be migrated because it is already on the - * target node - * 1 - when it has been queued - */ -static int add_page_for_migration(struct mm_struct *mm, const void __user *p, - int node, struct list_head *pagelist, bool migrate_all) +static int __add_folio_for_migration(struct folio *folio, int node, + struct list_head *pagelist, bool migrate_all) { - struct vm_area_struct *vma; - unsigned long addr; - struct page *page; - struct folio *folio; - int err; + if (is_zero_folio(folio) || is_huge_zero_folio(folio)) + return -EFAULT; - mmap_read_lock(mm); - addr = (unsigned long)untagged_addr_remote(mm, p); - - err = -EFAULT; - vma = vma_lookup(mm, addr); - if (!vma || !vma_migratable(vma)) - goto out; - - /* FOLL_DUMP to ignore special (like zero) pages */ - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); - - err = PTR_ERR(page); - if (IS_ERR(page)) - goto out; - - err = -ENOENT; - if (!page) - goto out; - - folio = page_folio(page); if (folio_is_zone_device(folio)) - goto out_putfolio; + return -ENOENT; - err = 0; if (folio_nid(folio) == node) - goto out_putfolio; + return 0; - err = -EACCES; if (folio_likely_mapped_shared(folio) && !migrate_all) - goto out_putfolio; + return -EACCES; - err = -EBUSY; if (folio_test_hugetlb(folio)) { if (isolate_hugetlb(folio, pagelist)) - err = 1; - } else { - if (!folio_isolate_lru(folio)) - goto out_putfolio; - - err = 1; + return 1; + } else if (folio_isolate_lru(folio)) { list_add_tail(&folio->lru, pagelist); node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), folio_nr_pages(folio)); + return 1; + } + return -EBUSY; +} + +/* + * Resolves the given address to a struct folio, isolates it from the LRU and + * puts it to the given pagelist. + * Returns: + * errno - if the folio cannot be found/isolated + * 0 - when it doesn't have to be migrated because it is already on the + * target node + * 1 - when it has been queued + */ +static int add_folio_for_migration(struct mm_struct *mm, const void __user *p, + int node, struct list_head *pagelist, bool migrate_all) +{ + struct vm_area_struct *vma; + struct folio_walk fw; + struct folio *folio; + unsigned long addr; + int err = -EFAULT; + + mmap_read_lock(mm); + addr = (unsigned long)untagged_addr_remote(mm, p); + + vma = vma_lookup(mm, addr); + if (vma && vma_migratable(vma)) { + folio = folio_walk_start(&fw, vma, addr, FW_ZEROPAGE); + if (folio) { + err = __add_folio_for_migration(folio, node, pagelist, + migrate_all); + folio_walk_end(&fw, vma); + } else { + err = -ENOENT; + } } -out_putfolio: - /* - * Either remove the duplicate refcount from folio_isolate_lru() - * or drop the folio ref if it was not isolated. - */ - folio_put(folio); -out: mmap_read_unlock(mm); return err; } @@ -2275,8 +2265,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, * Errors in the page lookup or isolation are not fatal and we simply * report them via status */ - err = add_page_for_migration(mm, p, current_node, &pagelist, - flags & MPOL_MF_MOVE_ALL); + err = add_folio_for_migration(mm, p, current_node, &pagelist, + flags & MPOL_MF_MOVE_ALL); if (err > 0) { /* The page is successfully queued for migration */ From 184e916c628bbf51f85389e67b89ac95bde64188 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:18 +0200 Subject: [PATCH 068/416] mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk Let's use folio_walk instead, for example avoiding taking temporary folio references if the folio does not even apply and getting rid of one more follow_page() user. Note that zeropages obviously don't apply: old code could just have specified FOLL_DUMP. Anon folios are never secretmem, so we don't care about losing the check in follow_page(). Link: https://lkml.kernel.org/r/20240802155524.517137-6-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/ksm.c | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 14d9e53b1ec2..742b005f3f77 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -767,26 +767,28 @@ static struct page *get_mergeable_page(struct ksm_rmap_item *rmap_item) struct mm_struct *mm = rmap_item->mm; unsigned long addr = rmap_item->address; struct vm_area_struct *vma; - struct page *page; + struct page *page = NULL; + struct folio_walk fw; + struct folio *folio; mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (!vma) goto out; - page = follow_page(vma, addr, FOLL_GET); - if (IS_ERR_OR_NULL(page)) - goto out; - if (is_zone_device_page(page)) - goto out_putpage; - if (PageAnon(page)) { + folio = folio_walk_start(&fw, vma, addr, 0); + if (folio) { + if (!folio_is_zone_device(folio) && + folio_test_anon(folio)) { + folio_get(folio); + page = fw.page; + } + folio_walk_end(&fw, vma); + } +out: + if (page) { flush_anon_page(vma, page, addr); flush_dcache_page(page); - } else { -out_putpage: - put_page(page); -out: - page = NULL; } mmap_read_unlock(mm); return page; From b1d3e9bbccb4d295fbbce38b38478ba117e875ae Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:19 +0200 Subject: [PATCH 069/416] mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk Let's use folio_walk instead, for example avoiding taking temporary folio references if the folio does obviously not even apply and getting rid of one more follow_page() user. We cannot move all handling under the PTL, so leave the rmap handling (which implies an allocation) out. Note that zeropages obviously don't apply: old code could just have specified FOLL_DUMP. Further, we don't care about losing the secretmem check in follow_page(): these are never anon pages and vma_ksm_compatible() would never consider secretmem vmas (VM_SHARED | VM_MAYSHARE must be set for secretmem, see secretmem_mmap()). Link: https://lkml.kernel.org/r/20240802155524.517137-7-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/ksm.c | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 742b005f3f77..0f5b2bba4ef0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2564,36 +2564,46 @@ next_mm: ksm_scan.address = vma->vm_end; while (ksm_scan.address < vma->vm_end) { + struct page *tmp_page = NULL; + struct folio_walk fw; + struct folio *folio; + if (ksm_test_exit(mm)) break; - *page = follow_page(vma, ksm_scan.address, FOLL_GET); - if (IS_ERR_OR_NULL(*page)) { - ksm_scan.address += PAGE_SIZE; - cond_resched(); - continue; + + folio = folio_walk_start(&fw, vma, ksm_scan.address, 0); + if (folio) { + if (!folio_is_zone_device(folio) && + folio_test_anon(folio)) { + folio_get(folio); + tmp_page = fw.page; + } + folio_walk_end(&fw, vma); } - if (is_zone_device_page(*page)) - goto next_page; - if (PageAnon(*page)) { - flush_anon_page(vma, *page, ksm_scan.address); - flush_dcache_page(*page); + + if (tmp_page) { + flush_anon_page(vma, tmp_page, ksm_scan.address); + flush_dcache_page(tmp_page); rmap_item = get_next_rmap_item(mm_slot, ksm_scan.rmap_list, ksm_scan.address); if (rmap_item) { ksm_scan.rmap_list = &rmap_item->rmap_list; - if (should_skip_rmap_item(*page, rmap_item)) + if (should_skip_rmap_item(tmp_page, rmap_item)) { + folio_put(folio); goto next_page; + } ksm_scan.address += PAGE_SIZE; - } else - put_page(*page); + *page = tmp_page; + } else { + folio_put(folio); + } mmap_read_unlock(mm); return rmap_item; } next_page: - put_page(*page); ksm_scan.address += PAGE_SIZE; cond_resched(); } From 8710f6ed34e7bcf98971d65acfb5eaed5fc469b0 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:20 +0200 Subject: [PATCH 070/416] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk Let's remove yet another follow_page() user. Note that we have to do the split without holding the PTL, after folio_walk_end(). We don't care about losing the secretmem check in follow_page(). [david@redhat.com: teach can_split_folio() that we are not holding an additional reference] Link: https://lkml.kernel.org/r/c75d1c6c-8ea6-424f-853c-1ccda6c77ba2@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-8-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Zi Yan Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 4 ++-- mm/huge_memory.c | 27 ++++++++++++++++----------- mm/vmscan.c | 2 +- 3 files changed, 19 insertions(+), 14 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e25d9ebfdf89..ce44caa40eed 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags); -bool can_split_folio(struct folio *folio, int *pextra_pins); +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); static inline int split_huge_page(struct page *page) @@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, } static inline bool -can_split_folio(struct folio *folio, int *pextra_pins) +can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins) { return false; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 29e76d285e4b..666fa675e5b6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -3017,7 +3018,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, } /* Racy check whether the huge page can be split */ -bool can_split_folio(struct folio *folio, int *pextra_pins) +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins) { int extra_pins; @@ -3029,7 +3030,8 @@ bool can_split_folio(struct folio *folio, int *pextra_pins) extra_pins = folio_nr_pages(folio); if (pextra_pins) *pextra_pins = extra_pins; - return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1; + return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - + caller_pins; } /* @@ -3197,7 +3199,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, * Racy check if we can split the page, before unmap_folio() will * split PMDs */ - if (!can_split_folio(folio, &extra_pins)) { + if (!can_split_folio(folio, 1, &extra_pins)) { ret = -EAGAIN; goto out_unlock; } @@ -3504,7 +3506,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, */ for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) { struct vm_area_struct *vma = vma_lookup(mm, addr); - struct page *page; + struct folio_walk fw; struct folio *folio; if (!vma) @@ -3516,13 +3518,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, continue; } - /* FOLL_DUMP to ignore special (like zero) pages */ - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP); - - if (IS_ERR_OR_NULL(page)) + folio = folio_walk_start(&fw, vma, addr, 0); + if (!folio) continue; - folio = page_folio(page); if (!is_transparent_hugepage(folio)) goto next; @@ -3536,18 +3535,24 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, * can be split or not. So skip the check here. */ if (!folio_test_private(folio) && - !can_split_folio(folio, NULL)) + !can_split_folio(folio, 0, NULL)) goto next; if (!folio_trylock(folio)) goto next; + folio_get(folio); + folio_walk_end(&fw, vma); if (!split_folio_to_order(folio, new_order)) split++; folio_unlock(folio); -next: folio_put(folio); + + cond_resched(); + continue; +next: + folio_walk_end(&fw, vma); cond_resched(); } mmap_read_unlock(mm); diff --git a/mm/vmscan.c b/mm/vmscan.c index 6b70e80ca0ee..96ce889ea3d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1227,7 +1227,7 @@ retry: goto keep_locked; if (folio_test_large(folio)) { /* cannot split folio, skip it */ - if (!can_split_folio(folio, NULL)) + if (!can_split_folio(folio, 1, NULL)) goto activate_locked; /* * Split partially mapped folios right away. From 85a7e5432dba4c750129797feb399a7bff8b241f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:21 +0200 Subject: [PATCH 071/416] s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk Let's get rid of another follow_page() user and perform the UV calls under PTL -- which likely should be fine. No need for an additional reference while holding the PTL: uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount, so any concurrent make_folio_secure() would see an unexpted reference and cannot set PG_arch_1 concurrently. Do we really need a writable PTE? Likely yes, because the "destroy" part is, in comparison to the export, a destructive operation. So we'll keep the writability check for now. We'll lose the secretmem check from follow_page(). Likely we don't care about that here. Link: https://lkml.kernel.org/r/20240802155524.517137-9-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Claudio Imbrenda Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/s390/kernel/uv.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c index 35ed2aea8891..9646f773208a 100644 --- a/arch/s390/kernel/uv.c +++ b/arch/s390/kernel/uv.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -462,9 +463,9 @@ EXPORT_SYMBOL_GPL(gmap_convert_to_secure); int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr) { struct vm_area_struct *vma; + struct folio_walk fw; unsigned long uaddr; struct folio *folio; - struct page *page; int rc; rc = -EFAULT; @@ -483,11 +484,15 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr) goto out; rc = 0; - /* we take an extra reference here */ - page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET); - if (IS_ERR_OR_NULL(page)) + folio = folio_walk_start(&fw, vma, uaddr, 0); + if (!folio) goto out; - folio = page_folio(page); + /* + * See gmap_make_secure(): large folios cannot be secure. Small + * folio implies FW_LEVEL_PTE. + */ + if (folio_test_large(folio) || !pte_write(fw.pte)) + goto out_walk_end; rc = uv_destroy_folio(folio); /* * Fault handlers can race; it is possible that two CPUs will fault @@ -500,7 +505,8 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr) */ if (rc) rc = uv_convert_from_secure_folio(folio); - folio_put(folio); +out_walk_end: + folio_walk_end(&fw, vma); out: mmap_read_unlock(gmap->mm); return rc; From 0b31a3cef4462ef99ef4667291091ecbcfcc8749 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:22 +0200 Subject: [PATCH 072/416] s390/mm/fault: convert do_secure_storage_access() from follow_page() to folio_walk Let's get rid of another follow_page() user and perform the conversion under PTL: Note that this is also what follow_page_pte() ends up doing. Unfortunately we cannot currently optimize out the additional reference, because arch_make_folio_accessible() must be called with a raised refcount to protect against concurrent conversion to secure. We can just move the arch_make_folio_accessible() under the PTL, like follow_page_pte() would. We'll effectively drop the "writable" check implied by FOLL_WRITE: follow_page_pte() would also not check that when calling arch_make_folio_accessible(), so there is no good reason for doing that here. We'll lose the secretmem check from follow_page() as well, about which we shouldn't really care. Link: https://lkml.kernel.org/r/20240802155524.517137-10-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Claudio Imbrenda Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/s390/mm/fault.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 8e149ef5e89b..ad8b0d6b77ea 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -492,9 +493,9 @@ void do_secure_storage_access(struct pt_regs *regs) union teid teid = { .val = regs->int_parm_long }; unsigned long addr = get_fault_address(regs); struct vm_area_struct *vma; + struct folio_walk fw; struct mm_struct *mm; struct folio *folio; - struct page *page; struct gmap *gmap; int rc; @@ -536,15 +537,18 @@ void do_secure_storage_access(struct pt_regs *regs) vma = find_vma(mm, addr); if (!vma) return handle_fault_error(regs, SEGV_MAPERR); - page = follow_page(vma, addr, FOLL_WRITE | FOLL_GET); - if (IS_ERR_OR_NULL(page)) { + folio = folio_walk_start(&fw, vma, addr, 0); + if (!folio) { mmap_read_unlock(mm); break; } - folio = page_folio(page); - if (arch_make_folio_accessible(folio)) - send_sig(SIGSEGV, current, 0); + /* arch_make_folio_accessible() needs a raised refcount. */ + folio_get(folio); + rc = arch_make_folio_accessible(folio); folio_put(folio); + folio_walk_end(&fw, vma); + if (rc) + send_sig(SIGSEGV, current, 0); mmap_read_unlock(mm); break; case KERNEL_FAULT: From 7290840de65e04c1fc59dab692468fc9600c6038 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:23 +0200 Subject: [PATCH 073/416] mm: remove follow_page() All users are gone, let's remove it and any leftovers in comments. We'll leave any FOLL/follow_page_() naming cleanups as future work. Link: https://lkml.kernel.org/r/20240802155524.517137-11-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/mm/transhuge.rst | 6 +++--- include/linux/mm.h | 3 --- mm/filemap.c | 2 +- mm/gup.c | 24 +----------------------- mm/nommu.c | 6 ------ 5 files changed, 5 insertions(+), 36 deletions(-) diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 1ba0ad63246c..a2cd8800d527 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -31,10 +31,10 @@ Design principles feature that applies to all dynamic high order allocations in the kernel) -get_user_pages and follow_page -============================== +get_user_pages and pin_user_pages +================================= -get_user_pages and follow_page if run on a hugepage, will return the +get_user_pages and pin_user_pages if run on a hugepage, will return the head or tail pages as usual (exactly as they would do on hugetlbfs). Most GUP users will only care about the actual physical address of the page and its temporary pinning to release after the I/O diff --git a/include/linux/mm.h b/include/linux/mm.h index 2f6c08b53e4f..ee8cea73d415 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3527,9 +3527,6 @@ static inline vm_fault_t vmf_fs_error(int err) return VM_FAULT_SIGBUS; } -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int foll_flags); - static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) { if (vm_fault & VM_FAULT_OOM) diff --git a/mm/filemap.c b/mm/filemap.c index d62150418b91..4130be74f6fd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -112,7 +112,7 @@ * ->swap_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one) - * ->lruvec->lru_lock (follow_page->mark_page_accessed) + * ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) * ->lruvec->lru_lock (check_pte_range->isolate_lru_page) * ->private_lock (folio_remove_rmap_pte->set_page_dirty) * ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) diff --git a/mm/gup.c b/mm/gup.c index 3e8484c893aa..d19884e097fd 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1072,28 +1072,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, return page; } -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int foll_flags) -{ - struct follow_page_context ctx = { NULL }; - struct page *page; - - if (vma_is_secretmem(vma)) - return NULL; - - if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) - return NULL; - - /* - * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect - * to fail on PROT_NONE-mapped pages. - */ - page = follow_page_mask(vma, address, foll_flags, &ctx); - if (ctx.pgmap) - put_dev_pagemap(ctx.pgmap); - return page; -} - static int get_gate_page(struct mm_struct *mm, unsigned long address, unsigned int gup_flags, struct vm_area_struct **vma, struct page **page) @@ -2519,7 +2497,7 @@ static bool is_valid_gup_args(struct page **pages, int *locked, * These flags not allowed to be specified externally to the gup * interfaces: * - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only - * - FOLL_REMOTE is internal only and used on follow_page() + * - FOLL_REMOTE is internal only, set in (get|pin)_user_pages_remote() * - FOLL_UNLOCKABLE is internal only and used if locked is !NULL */ if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS)) diff --git a/mm/nommu.c b/mm/nommu.c index 40cac1348b40..385b0c15add8 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1578,12 +1578,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, return ret; } -struct page *follow_page(struct vm_area_struct *vma, unsigned long address, - unsigned int foll_flags) -{ - return NULL; -} - int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot) { From e317a8d8b4f600fc7ec9725e26417030ee594f52 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 2 Aug 2024 17:55:24 +0200 Subject: [PATCH 074/416] mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk Let's simplify by reusing folio_walk. Keep the existing behavior by handling migration entries and zeropages. Link: https://lkml.kernel.org/r/20240802155524.517137-12-david@redhat.com Signed-off-by: David Hildenbrand Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Janosch Frank Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Ryan Roberts Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/ksm.c | 63 ++++++++++++++------------------------------------------ 1 file changed, 16 insertions(+), 47 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 0f5b2bba4ef0..8e53666bc7b0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -608,47 +608,6 @@ static inline bool ksm_test_exit(struct mm_struct *mm) return atomic_read(&mm->mm_users) == 0; } -static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, - struct mm_walk *walk) -{ - struct page *page = NULL; - spinlock_t *ptl; - pte_t *pte; - pte_t ptent; - int ret; - - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); - if (!pte) - return 0; - ptent = ptep_get(pte); - if (pte_present(ptent)) { - page = vm_normal_page(walk->vma, addr, ptent); - } else if (!pte_none(ptent)) { - swp_entry_t entry = pte_to_swp_entry(ptent); - - /* - * As KSM pages remain KSM pages until freed, no need to wait - * here for migration to end. - */ - if (is_migration_entry(entry)) - page = pfn_swap_entry_to_page(entry); - } - /* return 1 if the page is an normal ksm page or KSM-placed zero page */ - ret = (page && PageKsm(page)) || is_ksm_zero_pte(ptent); - pte_unmap_unlock(pte, ptl); - return ret; -} - -static const struct mm_walk_ops break_ksm_ops = { - .pmd_entry = break_ksm_pmd_entry, - .walk_lock = PGWALK_RDLOCK, -}; - -static const struct mm_walk_ops break_ksm_lock_vma_ops = { - .pmd_entry = break_ksm_pmd_entry, - .walk_lock = PGWALK_WRLOCK, -}; - /* * We use break_ksm to break COW on a ksm page by triggering unsharing, * such that the ksm page will get replaced by an exclusive anonymous page. @@ -665,16 +624,26 @@ static const struct mm_walk_ops break_ksm_lock_vma_ops = { static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma) { vm_fault_t ret = 0; - const struct mm_walk_ops *ops = lock_vma ? - &break_ksm_lock_vma_ops : &break_ksm_ops; + + if (lock_vma) + vma_start_write(vma); do { - int ksm_page; + bool ksm_page = false; + struct folio_walk fw; + struct folio *folio; cond_resched(); - ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL); - if (WARN_ON_ONCE(ksm_page < 0)) - return ksm_page; + folio = folio_walk_start(&fw, vma, addr, + FW_MIGRATION | FW_ZEROPAGE); + if (folio) { + /* Small folio implies FW_LEVEL_PTE. */ + if (!folio_test_large(folio) && + (folio_test_ksm(folio) || is_ksm_zero_pte(fw.pte))) + ksm_page = true; + folio_walk_end(&fw, vma); + } + if (!ksm_page) return 0; ret = handle_mm_fault(vma, addr, From a06e79d383cf0ab694d3d012261b9df36e6dfc3e Mon Sep 17 00:00:00 2001 From: Yang Li Date: Fri, 2 Aug 2024 14:02:16 +0800 Subject: [PATCH 075/416] mm: remove duplicated include in vma_internal.h The header files linux/bug.h is included twice in vma_internal.h, so one inclusion of each can be removed. Link: https://lkml.kernel.org/r/20240802060216.24591-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li Reported-by: Abaci Robot Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9636 Reviewed-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/vma_internal.h | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/vma_internal.h b/mm/vma_internal.h index 14c24d5cb582..b930ab12a587 100644 --- a/mm/vma_internal.h +++ b/mm/vma_internal.h @@ -12,7 +12,6 @@ #include #include #include -#include #include #include #include From 69b50d4351ed924f29e3d46b159e28f70dfc707f Mon Sep 17 00:00:00 2001 From: David Gow Date: Sat, 3 Aug 2024 15:46:41 +0800 Subject: [PATCH 076/416] mm: only enforce minimum stack gap size if it's sensible The generic mmap_base code tries to leave a gap between the top of the stack and the mmap base address, but enforces a minimum gap size (MIN_GAP) of 128MB, which is too large on some setups. In particular, on arm tasks without ADDR_LIMIT_32BIT, the STACK_TOP value is less than 128MB, so it's impossible to fit such a gap in. Only enforce this minimum if MIN_GAP < MAX_GAP, as we'd prefer to honour MAX_GAP, which is defined proportionally, so scales better and always leaves us with both _some_ stack space and some room for mmap. This fixes the usercopy KUnit test suite on 32-bit arm, as it doesn't set any personality flags so gets the default (in this case 26-bit) task size. This test can be run with: ./tools/testing/kunit/kunit.py run --arch arm usercopy --make_options LLVM=1 Link: https://lkml.kernel.org/r/20240803074642.1849623-2-davidgow@google.com Fixes: dba79c3df4a2 ("arm: use generic mmap top-down layout and brk randomization") Signed-off-by: David Gow Reviewed-by: Kees Cook Cc: Alexandre Ghiti Cc: Linus Walleij Cc: Luis Chamberlain Cc: Mark Rutland Cc: Russell King Cc: Signed-off-by: Andrew Morton --- mm/util.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/util.c b/mm/util.c index ac01925a4179..4f1275023eb7 100644 --- a/mm/util.c +++ b/mm/util.c @@ -463,7 +463,7 @@ static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack) if (gap + pad > gap) gap += pad; - if (gap < MIN_GAP) + if (gap < MIN_GAP && MIN_GAP < MAX_GAP) gap = MIN_GAP; else if (gap > MAX_GAP) gap = MAX_GAP; From e31c38e037621c445bb4393fd77e0a76e6e0899a Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Mon, 5 Aug 2024 16:22:42 -0700 Subject: [PATCH 077/416] zswap: implement a second chance algorithm for dynamic zswap shrinker Patch series "improving dynamic zswap shrinker protection scheme", v3. When experimenting with the memory-pressure based (i.e "dynamic") zswap shrinker in production, we observed a sharp increase in the number of swapins, which led to performance regression. We were able to trace this regression to the following problems with the shrinker's warm pages protection scheme: 1. The protection decays way too rapidly, and the decaying is coupled with zswap stores, leading to anomalous patterns, in which a small batch of zswap stores effectively erase all the protection in place for the warmer pages in the zswap LRU. This observation has also been corroborated upstream by Takero Funaki (in [1]). 2. We inaccurately track the number of swapped in pages, missing the non-pivot pages that are part of the readahead window, while counting the pages that are found in the zswap pool. To alleviate these two issues, this patch series improve the dynamic zswap shrinker in the following manner: 1. Replace the protection size tracking scheme with a second chance algorithm. This new scheme removes the need for haphazard stats decaying, and automatically adjusts the pace of pages aging with memory pressure, and writeback rate with pool activities: slowing down when the pool is dominated with zswpouts, and speeding up when the pool is dominated with stale entries. 2. Fix the tracking of the number of swapins to take into account non-pivot pages in the readahead window. With these two changes in place, in a kernel-building benchmark without any cold data added, the number of swapins is reduced by 64.12%. This translate to a 10.32% reduction in build time. We also observe a 3% reduction in kernel CPU time. In another benchmark, with cold data added (to gauge the new algorithm's ability to offload cold data), the new second chance scheme outperforms the old protection scheme by around 0.7%, and actually written back around 21% more pages to backing swap device. So the new scheme is just as good, if not even better than the old scheme on this front as well. [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/ This patch (of 2): Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate. The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm. We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. [nphamcs@gmail.com: fix a small mistake in the referenced bit documentation] Link: https://lkml.kernel.org/r/20240806003403.3142387-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-2-nphamcs@gmail.com Signed-off-by: Nhat Pham Suggested-by: Johannes Weiner Acked-by: Yosry Ahmed Cc: Chengming Zhou Cc: Shakeel Butt Cc: Takero Funaki Signed-off-by: Andrew Morton --- include/linux/zswap.h | 16 +++---- mm/zswap.c | 108 ++++++++++++++++++++++++------------------ 2 files changed, 70 insertions(+), 54 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index 6cecb4a4f68b..9cd1beef0654 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -13,17 +13,15 @@ extern atomic_t zswap_stored_pages; struct zswap_lruvec_state { /* - * Number of pages in zswap that should be protected from the shrinker. - * This number is an estimate of the following counts: + * Number of swapped in pages from disk, i.e not found in the zswap pool. * - * a) Recent page faults. - * b) Recent insertion to the zswap LRU. This includes new zswap stores, - * as well as recent zswap LRU rotations. - * - * These pages are likely to be warm, and might incur IO if the are written - * to swap. + * This is consumed and subtracted from the lru size in + * zswap_shrinker_count() to penalize past overshrinking that led to disk + * swapins. The idea is that had we considered this many more pages in the + * LRU active/protected and not written them back, we would not have had to + * swapped them in. */ - atomic_long_t nr_zswap_protected; + atomic_long_t nr_disk_swapins; }; unsigned long zswap_total_pages(void); diff --git a/mm/zswap.c b/mm/zswap.c index 71b75ff1f3fb..df66ab102d27 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -187,6 +187,10 @@ static struct shrinker *zswap_shrinker; * length - the length in bytes of the compressed page data. Needed during * decompression. For a same value filled page length is 0, and both * pool and lru are invalid and must be ignored. + * referenced - true if the entry recently entered the zswap pool. Unset by the + * writeback logic. The entry is only reclaimed by the writeback + * logic if referenced is unset. See comments in the shrinker + * section for context. * pool - the zswap_pool the entry's data is in * handle - zpool allocation handle that stores the compressed page data * value - value of the same-value filled pages which have same content @@ -196,6 +200,7 @@ static struct shrinker *zswap_shrinker; struct zswap_entry { swp_entry_t swpentry; unsigned int length; + bool referenced; struct zswap_pool *pool; union { unsigned long handle; @@ -700,11 +705,8 @@ static inline int entry_to_nid(struct zswap_entry *entry) static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) { - atomic_long_t *nr_zswap_protected; - unsigned long lru_size, old, new; int nid = entry_to_nid(entry); struct mem_cgroup *memcg; - struct lruvec *lruvec; /* * Note that it is safe to use rcu_read_lock() here, even in the face of @@ -722,19 +724,6 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry) memcg = mem_cgroup_from_entry(entry); /* will always succeed */ list_lru_add(list_lru, &entry->lru, nid, memcg); - - /* Update the protection area */ - lru_size = list_lru_count_one(list_lru, nid, memcg); - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); - nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected; - old = atomic_long_inc_return(nr_zswap_protected); - /* - * Decay to avoid overflow and adapt to changing workloads. - * This is based on LRU reclaim cost decaying heuristics. - */ - do { - new = old > lru_size / 4 ? old / 2 : old; - } while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new)); rcu_read_unlock(); } @@ -752,7 +741,7 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) void zswap_lruvec_state_init(struct lruvec *lruvec) { - atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0); + atomic_long_set(&lruvec->zswap_lruvec_state.nr_disk_swapins, 0); } void zswap_folio_swapin(struct folio *folio) @@ -761,7 +750,7 @@ void zswap_folio_swapin(struct folio *folio) if (folio) { lruvec = folio_lruvec(folio); - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); + atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins); } } @@ -1095,6 +1084,28 @@ static int zswap_writeback_entry(struct zswap_entry *entry, /********************************* * shrinker functions **********************************/ +/* + * The dynamic shrinker is modulated by the following factors: + * + * 1. Each zswap entry has a referenced bit, which the shrinker unsets (giving + * the entry a second chance) before rotating it in the LRU list. If the + * entry is considered again by the shrinker, with its referenced bit unset, + * it is written back. The writeback rate as a result is dynamically + * adjusted by the pool activities - if the pool is dominated by new entries + * (i.e lots of recent zswapouts), these entries will be protected and + * the writeback rate will slow down. On the other hand, if the pool has a + * lot of stagnant entries, these entries will be reclaimed immediately, + * effectively increasing the writeback rate. + * + * 2. Swapins counter: If we observe swapins, it is a sign that we are + * overshrinking and should slow down. We maintain a swapins counter, which + * is consumed and subtract from the number of eligible objects on the LRU + * in zswap_shrinker_count(). + * + * 3. Compression ratio. The better the workload compresses, the less gains we + * can expect from writeback. We scale down the number of objects available + * for reclaim by this ratio. + */ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l, spinlock_t *lock, void *arg) { @@ -1104,6 +1115,16 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o enum lru_status ret = LRU_REMOVED_RETRY; int writeback_result; + /* + * Second chance algorithm: if the entry has its referenced bit set, give it + * a second chance. Only clear the referenced bit and rotate it in the + * zswap's LRU list. + */ + if (entry->referenced) { + entry->referenced = false; + return LRU_ROTATE; + } + /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: @@ -1170,8 +1191,7 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { - struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); - unsigned long shrink_ret, nr_protected, lru_size; + unsigned long shrink_ret; bool encountered_page_in_swapcache = false; if (!zswap_shrinker_enabled || @@ -1180,25 +1200,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, return SHRINK_STOP; } - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); - lru_size = list_lru_shrink_count(&zswap_list_lru, sc); - - /* - * Abort if we are shrinking into the protected region. - * - * This short-circuiting is necessary because if we have too many multiple - * concurrent reclaimers getting the freeable zswap object counts at the - * same time (before any of them made reasonable progress), the total - * number of reclaimed objects might be more than the number of unprotected - * objects (i.e the reclaimers will reclaim into the protected area of the - * zswap LRU). - */ - if (nr_protected >= lru_size - sc->nr_to_scan) { - sc->nr_scanned = 0; - return SHRINK_STOP; - } - shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb, &encountered_page_in_swapcache); @@ -1213,7 +1214,10 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, { struct mem_cgroup *memcg = sc->memcg; struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); - unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; + atomic_long_t *nr_disk_swapins = + &lruvec->zswap_lruvec_state.nr_disk_swapins; + unsigned long nr_backing, nr_stored, nr_freeable, nr_disk_swapins_cur, + nr_remain; if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg)) return 0; @@ -1246,14 +1250,27 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, if (!nr_stored) return 0; - nr_protected = - atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected); nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc); + if (!nr_freeable) + return 0; + /* - * Subtract the lru size by an estimate of the number of pages - * that should be protected. + * Subtract from the lru size the number of pages that are recently swapped + * in from disk. The idea is that had we protect the zswap's LRU by this + * amount of pages, these disk swapins would not have happened. */ - nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0; + nr_disk_swapins_cur = atomic_long_read(nr_disk_swapins); + do { + if (nr_freeable >= nr_disk_swapins_cur) + nr_remain = 0; + else + nr_remain = nr_disk_swapins_cur - nr_freeable; + } while (!atomic_long_try_cmpxchg( + nr_disk_swapins, &nr_disk_swapins_cur, nr_remain)); + + nr_freeable -= nr_disk_swapins_cur - nr_remain; + if (!nr_freeable) + return 0; /* * Scale the number of freeable pages by the memory saving factor. @@ -1506,6 +1523,7 @@ bool zswap_store(struct folio *folio) store_entry: entry->swpentry = swp; entry->objcg = objcg; + entry->referenced = true; old = xa_store(tree, offset, entry, GFP_KERNEL); if (xa_is_err(old)) { From 0e400844724222a67b962b0f12c7964b5e9a236c Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Mon, 5 Aug 2024 16:22:43 -0700 Subject: [PATCH 078/416] zswap: track swapins from disk more accurately Currently, there are a couple of issues with our disk swapin tracking for dynamic zswap shrinker heuristics: 1. We only increment the swapin counter on pivot pages. This means we are not taking into account pages that also need to be swapped in, but are already taken care of as part of the readahead window. 2. We are also incrementing when the pages are read from the zswap pool, which is inaccurate. This patch rectifies these issues by incrementing the counter whenever we need to perform a non-zswap read. Note that we are slightly overcounting, as a page might be read into memory by the readahead algorithm even though it will not be neeeded by users - however, this is an acceptable inaccuracy, as the readahead logic itself will adapt to these kind of scenarios. To test this change, I built the kernel under a cgroup with its memory.max set to 2 GB: real: 236.66s user: 4286.06s sys: 652.86s swapins: 81552 For comparison, with just the new second chance algorithm, the build time is as follows: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 Without neither: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 (average over 5 runs) With this change, the kernel CPU time reduces by a further 1.7%, and the real time is reduced by another 3.3%, compared to just the second chance algorithm by itself. The swapins count also reduces by another 13.85%. Combinng the two changes, we reduce the real time by 10.32%, kernel CPU time by 3%, and number of swapins by 64.12%. To gauge the new scheme's ability to offload cold data, I ran another benchmark, in which the kernel was built under a cgroup with memory.max set to 3 GB, but with 0.5 GB worth of cold data allocated before each build (in a shmem file). Under the old scheme: real: 197.18s user: 4365.08s sys: 289.02s zswpwb: 72115.2 Under the new scheme: real: 195.8s user: 4362.25s sys: 290.14s zswpwb: 87277.8 (average over 5 runs) Notice that we actually observe a 21% increase in the number of written back pages - so the new scheme is just as good, if not better at offloading pages from the zswap pool when they are cold. Build time reduces by around 0.7% as a result. [nphamcs@gmail.com: squeeze a comment into a single line] Link: https://lkml.kernel.org/r/20240806004518.3183562-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20240805232243.2896283-3-nphamcs@gmail.com Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure") Signed-off-by: Nhat Pham Suggested-by: Johannes Weiner Acked-by: Yosry Ahmed Acked-by: Johannes Weiner Cc: Chengming Zhou Cc: Shakeel Butt Cc: Takero Funaki Signed-off-by: Andrew Morton --- mm/page_io.c | 9 ++++++++- mm/swap_state.c | 8 ++------ 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index ff8c99ee3af7..aa190e3cb050 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -521,7 +521,13 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) if (zswap_load(folio)) { folio_unlock(folio); - } else if (data_race(sis->flags & SWP_FS_OPS)) { + goto finish; + } + + /* We have to read from slower devices. Increase zswap protection. */ + zswap_folio_swapin(folio); + + if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); } else if (synchronous) { swap_read_folio_bdev_sync(folio, sis); @@ -529,6 +535,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) swap_read_folio_bdev_async(folio, sis); } +finish: if (workingset) { delayacct_thrashing_end(&in_thrashing); psi_memstall_leave(&pflags); diff --git a/mm/swap_state.c b/mm/swap_state.c index 40c681ae9c3c..293ff1afdca4 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -702,10 +702,8 @@ skip: /* The page was likely read above, so no need for plugging here */ folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx, &page_allocated, false); - if (unlikely(page_allocated)) { - zswap_folio_swapin(folio); + if (unlikely(page_allocated)) swap_read_folio(folio, NULL); - } return folio; } @@ -854,10 +852,8 @@ skip: /* The folio was likely read above, so no need for plugging here */ folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, &page_allocated, false); - if (unlikely(page_allocated)) { - zswap_folio_swapin(folio); + if (unlikely(page_allocated)) swap_read_folio(folio, NULL); - } return folio; } From 17fe833b0de08a78bfb51388e0615969d73ea8ad Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Mon, 5 Aug 2024 14:52:03 +0200 Subject: [PATCH 079/416] mm: fix (harmless) type confusion in lock_vma_under_rcu() There is a (harmless) type confusion in lock_vma_under_rcu(): After vma_start_read(), we have taken the VMA lock but don't know yet whether the VMA has already been detached and scheduled for RCU freeing. At this point, ->vm_start and ->vm_end are accessed. vm_area_struct contains a union such that ->vm_rcu uses the same memory as ->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a detached VMA is illegal and leads to type confusion between union members. Fix it by reordering the vma->detached check above the address checks, and document the rules for RCU readers accessing VMAs. This will probably change the number of observed VMA_LOCK_MISS events (since previously, trying to access a detached VMA whose ->vm_rcu has been scheduled would bail out when checking the fault address against the rcu_head members reinterpreted as VMA bounds). Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com Fixes: 50ee32537206 ("mm: introduce lock_vma_under_rcu to be used from arch-specific code") Signed-off-by: Jann Horn Acked-by: Suren Baghdasaryan Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 15 +++++++++++++-- mm/memory.c | 14 ++++++++++---- 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 165c58b12ccc..003619fab20e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -660,6 +660,9 @@ struct vma_numab_state { * per VM-area/task. A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). + * + * Only explicitly marked struct members may be accessed by RCU readers before + * getting a stable reference. */ struct vm_area_struct { /* The first cache line has the info for VMA tree walking. */ @@ -675,7 +678,11 @@ struct vm_area_struct { #endif }; - struct mm_struct *vm_mm; /* The address space we belong to. */ + /* + * The address space we belong to. + * Unstable RCU readers are allowed to read this. + */ + struct mm_struct *vm_mm; pgprot_t vm_page_prot; /* Access permissions of this VMA. */ /* @@ -688,7 +695,10 @@ struct vm_area_struct { }; #ifdef CONFIG_PER_VMA_LOCK - /* Flag to indicate areas detached from the mm->mm_mt tree */ + /* + * Flag to indicate areas detached from the mm->mm_mt tree. + * Unstable RCU readers are allowed to read this. + */ bool detached; /* @@ -706,6 +716,7 @@ struct vm_area_struct { * slowpath. */ int vm_lock_seq; + /* Unstable RCU readers are allowed to read this. */ struct vma_lock *vm_lock; #endif diff --git a/mm/memory.c b/mm/memory.c index 661168a46b43..46a44dc702fa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5997,10 +5997,6 @@ retry: if (!vma_start_read(vma)) goto inval; - /* Check since vm_start/vm_end might change before we lock the VMA */ - if (unlikely(address < vma->vm_start || address >= vma->vm_end)) - goto inval_end_read; - /* Check if the VMA got isolated after we found it */ if (vma->detached) { vma_end_read(vma); @@ -6008,6 +6004,16 @@ retry: /* The area was replaced with another one */ goto retry; } + /* + * At this point, we have a stable reference to a VMA: The VMA is + * locked and we know it hasn't already been isolated. + * From here on, we can access the VMA without worrying about which + * fields are accessible for RCU readers. + */ + + /* Check since vm_start/vm_end might change before we lock the VMA */ + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) + goto inval_end_read; rcu_read_unlock(); return vma; From cc0a0f98553528791ae33ba5ee8c118b52ae2028 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 5 Aug 2024 14:39:39 +0200 Subject: [PATCH 080/416] kfence: introduce burst mode Introduce burst mode, which can be configured with kfence.burst=$count, where the burst count denotes the additional successive slab allocations to be allocated through KFENCE for each sample interval. The idea is that this can give developers an additional knob to make KFENCE more aggressive when debugging specific issues of systems where either rebooting or recompiling the kernel with KASAN is not possible. Experiment: To assess the effectiveness of the new option, we randomly picked a recent out-of-bounds [1] and use-after-free bug [2], each with a reproducer provided by syzbot, that initially detected these bugs with KASAN. We then tried to reproduce the bugs with KFENCE below. [1] Fixed by: 7c55b78818cf ("jfs: xattr: fix buffer overflow for invalid xattr") https://syzkaller.appspot.com/bug?id=9d1b59d4718239da6f6069d3891863c25f9f24a2 [2] Fixed by: f8ad00f3fb2a ("l2tp: fix possible UAF when cleaning up tunnels") https://syzkaller.appspot.com/bug?id=4f34adc84f4a3b080187c390eeef60611fd450e1 The following KFENCE configs were compared. A pool size of 1023 objects was used for all configurations. Baseline kfence.sample_interval=100 kfence.skip_covered_thresh=75 kfence.burst=0 Aggressive kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=0 AggressiveBurst kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=1000 Each reproducer was run 10 times (after a fresh reboot), with the following detection counts for each KFENCE config: | Detection Count out of 10 | | OOB [1] | UAF [2] | ------------------+-------------+-------------+ Default | 0/10 | 0/10 | Aggressive | 0/10 | 0/10 | AggressiveBurst | 8/10 | 8/10 | With the Default and even the Aggressive configs the results are unsurprising, given KFENCE has not been designed for deterministic bug detection of small test cases. However, when enabling burst mode with relatively large burst count, KFENCE can start to detect heap memory-safety bugs even in simpler test cases with high probability (in the above cases with ~80% probability). Link: https://lkml.kernel.org/r/20240805124203.2692278-1-elver@google.com Signed-off-by: Marco Elver Reviewed-by: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Jann Horn Signed-off-by: Andrew Morton --- Documentation/dev-tools/kfence.rst | 7 +++++++ include/linux/kfence.h | 2 +- mm/kfence/core.c | 14 ++++++++++---- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/Documentation/dev-tools/kfence.rst b/Documentation/dev-tools/kfence.rst index 936f6aaa75c8..541899353865 100644 --- a/Documentation/dev-tools/kfence.rst +++ b/Documentation/dev-tools/kfence.rst @@ -53,6 +53,13 @@ configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``. The KUnit test suite is very likely to fail when using a deferrable timer since it currently causes very unpredictable sample intervals. +By default KFENCE will only sample 1 heap allocation within each sample +interval. *Burst mode* allows to sample successive heap allocations, where the +kernel boot parameter ``kfence.burst`` can be set to a non-zero value which +denotes the *additional* successive allocations within a sample interval; +setting ``kfence.burst=N`` means that ``1 + N`` successive allocations are +attempted through KFENCE for each sample interval. + The KFENCE memory pool is of fixed size, and if the pool is exhausted, no further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default 255), the number of available guarded objects can be controlled. Each object diff --git a/include/linux/kfence.h b/include/linux/kfence.h index 88100cc9caba..0ad1ddbb8b99 100644 --- a/include/linux/kfence.h +++ b/include/linux/kfence.h @@ -124,7 +124,7 @@ static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp if (!static_branch_likely(&kfence_allocation_key)) return NULL; #endif - if (likely(atomic_read(&kfence_allocation_gate))) + if (likely(atomic_read(&kfence_allocation_gate) > 0)) return NULL; return __kfence_alloc(s, size, flags); } diff --git a/mm/kfence/core.c b/mm/kfence/core.c index c5cb54fc696d..c3ef7eb8d4dc 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -99,6 +99,10 @@ module_param_cb(sample_interval, &sample_interval_param_ops, &kfence_sample_inte static unsigned long kfence_skip_covered_thresh __read_mostly = 75; module_param_named(skip_covered_thresh, kfence_skip_covered_thresh, ulong, 0644); +/* Allocation burst count: number of excess KFENCE allocations per sample. */ +static unsigned int kfence_burst __read_mostly; +module_param_named(burst, kfence_burst, uint, 0644); + /* If true, use a deferrable timer. */ static bool kfence_deferrable __read_mostly = IS_ENABLED(CONFIG_KFENCE_DEFERRABLE); module_param_named(deferrable, kfence_deferrable, bool, 0444); @@ -827,12 +831,12 @@ static void toggle_allocation_gate(struct work_struct *work) if (!READ_ONCE(kfence_enabled)) return; - atomic_set(&kfence_allocation_gate, 0); + atomic_set(&kfence_allocation_gate, -kfence_burst); #ifdef CONFIG_KFENCE_STATIC_KEYS /* Enable static key, and await allocation to happen. */ static_branch_enable(&kfence_allocation_key); - wait_event_idle(allocation_wait, atomic_read(&kfence_allocation_gate)); + wait_event_idle(allocation_wait, atomic_read(&kfence_allocation_gate) > 0); /* Disable static key and reset timer. */ static_branch_disable(&kfence_allocation_key); @@ -1052,6 +1056,7 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) unsigned long stack_entries[KFENCE_STACK_DEPTH]; size_t num_stack_entries; u32 alloc_stack_hash; + int allocation_gate; /* * Perform size check before switching kfence_allocation_gate, so that @@ -1080,14 +1085,15 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) if (s->flags & SLAB_SKIP_KFENCE) return NULL; - if (atomic_inc_return(&kfence_allocation_gate) > 1) + allocation_gate = atomic_inc_return(&kfence_allocation_gate); + if (allocation_gate > 1) return NULL; #ifdef CONFIG_KFENCE_STATIC_KEYS /* * waitqueue_active() is fully ordered after the update of * kfence_allocation_gate per atomic_inc_return(). */ - if (waitqueue_active(&allocation_wait)) { + if (allocation_gate == 1 && waitqueue_active(&allocation_wait)) { /* * Calling wake_up() here may deadlock when allocations happen * from within timer code. Use an irq_work to defer it. From 67203f3f2a63d429272f0c80451e5fcc469fdb46 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Wed, 7 Aug 2024 18:33:36 +0100 Subject: [PATCH 081/416] selftests/mm: add mseal test for no-discard madvise Add an mseal test for madvise() operations that aren't considered "discard" (e.g purely advisory ops such as MADV_RANDOM). [pedro.falcato@gmail.com: adjust the mseal test's plan] Link: https://lkml.kernel.org/r/20240807203724.2686144-1-pedro.falcato@gmail.com Link: https://lkml.kernel.org/r/20240807173336.2523757-3-pedro.falcato@gmail.com Signed-off-by: Pedro Falcato Tested-by: Jeff Xu Reviewed-by: Jeff Xu Cc: Kees Cook Cc: Liam R. Howlett Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mseal_test.c | 36 ++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c index a818f010de47..7eec3f0152e3 100644 --- a/tools/testing/selftests/mm/mseal_test.c +++ b/tools/testing/selftests/mm/mseal_test.c @@ -1731,6 +1731,38 @@ static void test_seal_discard_ro_anon(bool seal) REPORT_TEST_PASS(); } +static void test_seal_madvise_nodiscard(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = seal_single_address(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* + * Test a random madvise flag like MADV_RANDOM that does not touch page + * contents (and thus should work for msealed VMAs). RANDOM also happens to + * share bits with other discard-ish flags like REMOVE. + */ + ret = sys_madvise(ptr, size, MADV_RANDOM); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + REPORT_TEST_PASS(); +} + int main(int argc, char **argv) { bool test_seal = seal_support(); @@ -1743,7 +1775,7 @@ int main(int argc, char **argv) if (!pkey_supported()) ksft_print_msg("PKEY not supported\n"); - ksft_set_plan(80); + ksft_set_plan(82); test_seal_addseal(); test_seal_unmapped_start(); @@ -1822,6 +1854,8 @@ int main(int argc, char **argv) test_seal_mremap_move_fixed_zero(true); test_seal_mremap_move_dontunmap_anyaddr(false); test_seal_mremap_move_dontunmap_anyaddr(true); + test_seal_madvise_nodiscard(false); + test_seal_madvise_nodiscard(true); test_seal_discard_ro_anon(false); test_seal_discard_ro_anon(true); test_seal_discard_ro_anon_on_rw(false); From 43c9074e6f093d304d55c43638732c402be75e2b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 7 Aug 2024 13:55:15 +0200 Subject: [PATCH 082/416] mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping It is not immediately obvious, but we can move the folio->_nr_pages_mapped update out of the loop and reduce the number of atomic ops without affecting the stats. The important point to realize is that only removing the last PMD mapping will result in _nr_pages_mapped going below ENTIRELY_MAPPED, not the individual atomic_inc_return_relaxed() calls. Concurrent races with removal of PMD mappings should be handled as expected, just like when we would have such races right now on a single mapcount update. In a simple munmap() microbenchmark [1] on 1 GiB of memory backed by the same PTE-mapped folio size (only mapped by a single process such that they will get completely unmapped), this change results in a speedup (positive is good) per folio size on a x86-64 Intel machine of roughly (a bit of noise expected): * 16 KiB: +10% * 32 KiB: +15% * 64 KiB: +17% * 128 KiB: +21% * 256 KiB: +22% * 512 KiB: +22% * 1024 KiB: +23% * 2048 KiB: +27% [1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c Link: https://lkml.kernel.org/r/20240807115515.1640951-1-david@redhat.com Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton --- mm/rmap.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 901950200957..a6b9cd0b2b18 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1158,7 +1158,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, { atomic_t *mapped = &folio->_nr_pages_mapped; const int orig_nr_pages = nr_pages; - int first, nr = 0; + int first = 0, nr = 0; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1170,13 +1170,13 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, } do { - first = atomic_inc_and_test(&page->_mapcount); - if (first) { - first = atomic_inc_return_relaxed(mapped); - if (first < ENTIRELY_MAPPED) - nr++; - } + first += atomic_inc_and_test(&page->_mapcount); } while (page++, --nr_pages > 0); + + if (first && + atomic_add_return_relaxed(first, mapped) < ENTIRELY_MAPPED) + nr = first; + atomic_add(orig_nr_pages, &folio->_large_mapcount); break; case RMAP_LEVEL_PMD: @@ -1527,7 +1527,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, enum rmap_level level) { atomic_t *mapped = &folio->_nr_pages_mapped; - int last, nr = 0, nr_pmdmapped = 0; + int last = 0, nr = 0, nr_pmdmapped = 0; bool partially_mapped = false; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1541,14 +1541,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, atomic_sub(nr_pages, &folio->_large_mapcount); do { - last = atomic_add_negative(-1, &page->_mapcount); - if (last) { - last = atomic_dec_return_relaxed(mapped); - if (last < ENTIRELY_MAPPED) - nr++; - } + last += atomic_add_negative(-1, &page->_mapcount); } while (page++, --nr_pages > 0); + if (last && + atomic_sub_return_relaxed(last, mapped) < ENTIRELY_MAPPED) + nr = last; + partially_mapped = nr && atomic_read(mapped); break; case RMAP_LEVEL_PMD: From 47baed6a132f611228113851a70c73f6e6478d87 Mon Sep 17 00:00:00 2001 From: Jianhui Zhou <912460177@qq.com> Date: Wed, 7 Aug 2024 15:44:48 +0800 Subject: [PATCH 083/416] percpu: remove pcpu_alloc_size() pcpu_alloc_size() was added in 7ac5c53e0073 "mm/percpu.c: introduce pcpu_alloc_size()", which is used to get the allocated memory size in bpf. However, pcpu_alloc_size() is no longer used in "bpf: Use c->unit_size to select target cache during free" because its actuall allocated memory size may change at runtime due to its slab merging mechanism. Therefore, pcpu_alloc_size() can be removed. Link: https://lkml.kernel.org/r/tencent_AD5C50E8D78C07A3CE539BD5F6BF39706507@qq.com Signed-off-by: Jianhui Zhou <912460177@qq.com> Cc: Christoph Lameter Cc: Dennis Zhou Cc: JonasZhou Cc: Tejun Heo Signed-off-by: Andrew Morton --- include/linux/percpu.h | 1 - mm/percpu.c | 31 ------------------------------- 2 files changed, 32 deletions(-) diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 4b2047b78b67..b6321fc49159 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -135,7 +135,6 @@ extern void __init setup_per_cpu_areas(void); extern void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, gfp_t gfp) __alloc_size(1); -extern size_t pcpu_alloc_size(void __percpu *__pdata); #define __alloc_percpu_gfp(_size, _align, _gfp) \ alloc_hooks(pcpu_alloc_noprof(_size, _align, false, _gfp)) diff --git a/mm/percpu.c b/mm/percpu.c index 20d91af8c033..da21680ff294 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2216,37 +2216,6 @@ static void pcpu_balance_workfn(struct work_struct *work) mutex_unlock(&pcpu_alloc_mutex); } -/** - * pcpu_alloc_size - the size of the dynamic percpu area - * @ptr: pointer to the dynamic percpu area - * - * Returns the size of the @ptr allocation. This is undefined for statically - * defined percpu variables as there is no corresponding chunk->bound_map. - * - * RETURNS: - * The size of the dynamic percpu area. - * - * CONTEXT: - * Can be called from atomic context. - */ -size_t pcpu_alloc_size(void __percpu *ptr) -{ - struct pcpu_chunk *chunk; - unsigned long bit_off, end; - void *addr; - - if (!ptr) - return 0; - - addr = __pcpu_ptr_to_addr(ptr); - /* No pcpu_lock here: ptr has not been freed, so chunk is still alive */ - chunk = pcpu_chunk_addr_search(addr); - bit_off = (addr - chunk->base_addr) / PCPU_MIN_ALLOC_SIZE; - end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk), - bit_off + 1); - return (end - bit_off) * PCPU_MIN_ALLOC_SIZE; -} - /** * free_percpu - free percpu area * @ptr: pointer to area to free From 62e73fd85d7bf63f1dde2bbcc464fe67970f326f Mon Sep 17 00:00:00 2001 From: "qiwu.chen" Date: Wed, 7 Aug 2024 10:56:27 +0800 Subject: [PATCH 084/416] mm: kfence: print the elapsed time for allocated/freed track Print the elapsed time for the allocated or freed track, which can be useful in some debugging scenarios. Link: https://lkml.kernel.org/r/20240807025627.37419-1-qiwu.chen@transsion.com Signed-off-by: qiwu.chen Reviewed-by: Marco Elver Cc: chenqiwu Cc: Dmitry Vyukov Signed-off-by: Andrew Morton --- mm/kfence/report.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/kfence/report.c b/mm/kfence/report.c index c509aed326ce..73a6fe42845a 100644 --- a/mm/kfence/report.c +++ b/mm/kfence/report.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -108,11 +109,14 @@ static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadat const struct kfence_track *track = show_alloc ? &meta->alloc_track : &meta->free_track; u64 ts_sec = track->ts_nsec; unsigned long rem_nsec = do_div(ts_sec, NSEC_PER_SEC); + u64 interval_nsec = local_clock() - meta->alloc_track.ts_nsec; + unsigned long rem_interval_nsec = do_div(interval_nsec, NSEC_PER_SEC); /* Timestamp matches printk timestamp format. */ - seq_con_printf(seq, "%s by task %d on cpu %d at %lu.%06lus:\n", + seq_con_printf(seq, "%s by task %d on cpu %d at %lu.%06lus (%lu.%06lus ago):\n", show_alloc ? "allocated" : "freed", track->pid, - track->cpu, (unsigned long)ts_sec, rem_nsec / 1000); + track->cpu, (unsigned long)ts_sec, rem_nsec / 1000, + (unsigned long)interval_nsec, rem_interval_nsec / 1000); if (track->num_stack_entries) { /* Skip allocation/free internals stack. */ From 420e05d0de18df907ba1c5c077af69e6ae469c9a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 7 Aug 2024 20:35:25 +0100 Subject: [PATCH 085/416] fs: remove calls to set and clear the folio error flag Nobody checks the folio error flag any more, so we can stop setting and clearing it. Also remove the documentation suggesting to not bother setting the error bit. Link: https://lkml.kernel.org/r/20240807193528.1865100-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/filesystems/vfs.rst | 3 +-- mm/filemap.c | 8 -------- mm/migrate.c | 2 -- mm/page_io.c | 4 +--- 4 files changed, 2 insertions(+), 15 deletions(-) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 6e903a903f8f..a6022ec59a2d 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -913,8 +913,7 @@ cache in your filesystem. The following members are defined: stop attempting I/O, it can simply return. The caller will remove the remaining pages from the address space, unlock them and decrement the page refcount. Set PageUptodate if the I/O - completes successfully. Setting PageError on any page will be - ignored; simply unlock the page if an I/O error occurs. + completes successfully. ``write_begin`` Called by the generic buffered write code to ask the filesystem diff --git a/mm/filemap.c b/mm/filemap.c index 4130be74f6fd..7665f2398d45 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -530,7 +530,6 @@ static void __filemap_fdatawait_range(struct address_space *mapping, struct folio *folio = fbatch.folios[i]; folio_wait_writeback(folio); - folio_clear_error(folio); } folio_batch_release(&fbatch); cond_resched(); @@ -2342,13 +2341,6 @@ static int filemap_read_folio(struct file *file, filler_t filler, unsigned long pflags; int error; - /* - * A previous I/O error may have been due to temporary failures, - * eg. multipath errors. PG_error will be set again if read_folio - * fails. - */ - folio_clear_error(folio); - /* Start the actual read. The read will unlock the page. */ if (unlikely(workingset)) psi_memstall_enter(&pflags); diff --git a/mm/migrate.c b/mm/migrate.c index e1383d9cc944..66a5f73ebfdf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -586,8 +586,6 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) { int cpupid; - if (folio_test_error(folio)) - folio_set_error(newfolio); if (folio_test_referenced(folio)) folio_set_referenced(newfolio); if (folio_test_uptodate(folio)) diff --git a/mm/page_io.c b/mm/page_io.c index aa190e3cb050..a00e2f615118 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -273,9 +273,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret) * memory for allocating transmit buffers. * Mark the page dirty and avoid * folio_rotate_reclaimable but rate-limit the - * messages but do not flag PageError like - * the normal direct-to-bio case as it could - * be temporary. + * messages. */ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n", ret, swap_dev_pos(page_swap_entry(page))); From 09022bc196d23484a7a5d48cf373f8583e3fcf23 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 7 Aug 2024 20:35:26 +0100 Subject: [PATCH 086/416] mm: remove PG_error The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- fs/proc/page.c | 1 - include/linux/page-flags.h | 6 +----- include/trace/events/mmflags.h | 1 - include/uapi/linux/kernel-page-flags.h | 2 +- 4 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index b7a5c84b5819..73a0f872d97f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -182,7 +182,6 @@ u64 stable_page_flags(const struct page *page) #endif u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); - u |= kpf_copy_bit(k, KPF_ERROR, PG_error); u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty); u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate); u |= kpf_copy_bit(k, KPF_WRITEBACK, PG_writeback); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5769fe6e4950..a0a29bd092f8 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -66,8 +66,6 @@ * PG_referenced, PG_reclaim are used for page reclaim for anonymous and * file-backed pagecache (see mm/vmscan.c). * - * PG_error is set to indicate that an I/O error occurred on this page. - * * PG_arch_1 is an architecture specific page state bit. The generic code * guarantees that this bit is cleared for a page when it first is entered into * the page cache. @@ -103,7 +101,6 @@ enum pageflags { PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ PG_active, PG_workingset, - PG_error, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, @@ -183,7 +180,7 @@ enum pageflags { */ /* At least one page in this folio has the hwpoison flag set */ - PG_has_hwpoisoned = PG_error, + PG_has_hwpoisoned = PG_active, PG_large_rmappable = PG_workingset, /* anon or file-backed */ }; @@ -506,7 +503,6 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } __PAGEFLAG(Locked, locked, PF_NO_TAIL) FOLIO_FLAG(waiters, FOLIO_HEAD_PAGE) -PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL) FOLIO_FLAG(referenced, FOLIO_HEAD_PAGE) FOLIO_TEST_CLEAR_FLAG(referenced, FOLIO_HEAD_PAGE) __FOLIO_SET_FLAG(referenced, FOLIO_HEAD_PAGE) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index b63d211bd141..b5c4da370a50 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -100,7 +100,6 @@ #define __def_pageflag_names \ DEF_PAGEFLAG_NAME(locked), \ DEF_PAGEFLAG_NAME(waiters), \ - DEF_PAGEFLAG_NAME(error), \ DEF_PAGEFLAG_NAME(referenced), \ DEF_PAGEFLAG_NAME(uptodate), \ DEF_PAGEFLAG_NAME(dirty), \ diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h index 6f2f2720f3ac..ff8032227876 100644 --- a/include/uapi/linux/kernel-page-flags.h +++ b/include/uapi/linux/kernel-page-flags.h @@ -7,7 +7,7 @@ */ #define KPF_LOCKED 0 -#define KPF_ERROR 1 +#define KPF_ERROR 1 /* Now unused */ #define KPF_REFERENCED 2 #define KPF_UPTODATE 3 #define KPF_DIRTY 4 From 94dc8bffd8b7fe83ba8382a3410a2f218dc20cb0 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 7 Aug 2024 20:37:32 +0100 Subject: [PATCH 087/416] mm: return the folio from swapin_readahead The unuse_pte_range() caller only wants the folio while do_swap_page() wants both the page and the folio. Since do_swap_page() already has logic for handling both the folio and the page, move the folio-to-page logic there. This also lets us allocate larger folios in the SWP_SYNCHRONOUS_IO path in future. Link: https://lkml.kernel.org/r/20240807193734.1865400-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/memory.c | 6 ++---- mm/swap.h | 6 +++--- mm/swap_state.c | 8 +++----- mm/swapfile.c | 5 +---- 4 files changed, 9 insertions(+), 16 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 46a44dc702fa..2ca87ceafede 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4091,7 +4091,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* skip swapcache */ folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); - page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); @@ -4116,10 +4115,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio->private = NULL; } } else { - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, + folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf); - if (page) - folio = page_folio(page); swapcache = folio; } @@ -4140,6 +4137,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); + page = folio_file_page(folio, swp_offset(entry)); } else if (PageHWPoison(page)) { /* * hwpoisoned dirty swapcache pages are kept for killing diff --git a/mm/swap.h b/mm/swap.h index 7c6330561d84..f8711ff82f84 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -73,8 +73,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags, bool skip_if_exists); struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct mempolicy *mpol, pgoff_t ilx); -struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, - struct vm_fault *vmf); +struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, + struct vm_fault *vmf); static inline unsigned int folio_swap_flags(struct folio *folio) { @@ -109,7 +109,7 @@ static inline struct folio *swap_cluster_readahead(swp_entry_t entry, return NULL; } -static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, +static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, struct vm_fault *vmf) { return NULL; diff --git a/mm/swap_state.c b/mm/swap_state.c index 293ff1afdca4..a042720554a7 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -863,13 +863,13 @@ skip: * @gfp_mask: memory allocation flags * @vmf: fault information * - * Returns the struct page for entry and addr, after queueing swapin. + * Returns the struct folio for entry and addr, after queueing swapin. * * It's a main entry function for swap readahead. By the configuration, * it will read ahead blocks by cluster-based(ie, physical disk based) * or vma-based(ie, virtual address based on faulty address) readahead. */ -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, +struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_fault *vmf) { struct mempolicy *mpol; @@ -882,9 +882,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, swap_cluster_readahead(entry, gfp_mask, mpol, ilx); mpol_cond_put(mpol); - if (!folio) - return NULL; - return folio_file_page(folio, swp_offset(entry)); + return folio; } #ifdef CONFIG_SYSFS diff --git a/mm/swapfile.c b/mm/swapfile.c index 0259f36de715..02bb4986a2c2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1980,7 +1980,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, folio = swap_cache_get_folio(entry, vma, addr); if (!folio) { - struct page *page; struct vm_fault vmf = { .vma = vma, .address = addr, @@ -1988,10 +1987,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, .pmd = pmd, }; - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, + folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); - if (page) - folio = page_folio(page); } if (!folio) { swp_count = READ_ONCE(si->swap_map[offset]); From 072cd213b75eb01fcf40eff898f8d5c008ce1457 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 7 Aug 2024 21:23:20 +0000 Subject: [PATCH 088/416] selftest mm/mseal: fix test_seal_mremap_move_dontunmap_anyaddr the syscall remap accepts following: mremap(src, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, dst) when the src is sealed, the call will fail with error code: EPERM Previously, the test uses hard-coded 0xdeaddead as dst, and it will fail on the system with newer glibc installed. This patch removes test's dependency on glibc for mremap(), also fix the test and remove the hardcoded address. Link: https://lkml.kernel.org/r/20240807212320.2831848-1-jeffxu@chromium.org Fixes: 4926c7a52de7 ("selftest mm/mseal memory sealing") Signed-off-by: Jeff Xu Reported-by: Pedro Falcato Cc: Dave Hansen Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mseal_test.c | 57 ++++++++++++++++--------- 1 file changed, 36 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c index 7eec3f0152e3..bd0bfda7aae7 100644 --- a/tools/testing/selftests/mm/mseal_test.c +++ b/tools/testing/selftests/mm/mseal_test.c @@ -110,6 +110,16 @@ static int sys_madvise(void *start, size_t len, int types) return sret; } +static void *sys_mremap(void *addr, size_t old_len, size_t new_len, + unsigned long flags, void *new_addr) +{ + void *sret; + + errno = 0; + sret = (void *) syscall(__NR_mremap, addr, old_len, new_len, flags, new_addr); + return sret; +} + static int sys_pkey_alloc(unsigned long flags, unsigned long init_val) { int ret = syscall(__NR_pkey_alloc, flags, init_val); @@ -1115,12 +1125,12 @@ static void test_seal_mremap_shrink(bool seal) } /* shrink from 4 pages to 2 pages. */ - ret2 = mremap(ptr, size, 2 * page_size, 0, 0); + ret2 = sys_mremap(ptr, size, 2 * page_size, 0, 0); if (seal) { - FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(ret2 == (void *) MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); } else { - FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); + FAIL_TEST_IF_FALSE(ret2 != (void *) MAP_FAILED); } @@ -1147,7 +1157,7 @@ static void test_seal_mremap_expand(bool seal) } /* expand from 2 page to 4 pages. */ - ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0); + ret2 = sys_mremap(ptr, 2 * page_size, 4 * page_size, 0, 0); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); @@ -1180,7 +1190,7 @@ static void test_seal_mremap_move(bool seal) } /* move from ptr to fixed address. */ - ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr); + ret2 = sys_mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); @@ -1299,7 +1309,7 @@ static void test_seal_mremap_shrink_fixed(bool seal) } /* mremap to move and shrink to fixed address */ - ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + ret2 = sys_mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); @@ -1330,7 +1340,7 @@ static void test_seal_mremap_expand_fixed(bool seal) } /* mremap to move and expand to fixed address */ - ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED, + ret2 = sys_mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); @@ -1361,7 +1371,7 @@ static void test_seal_mremap_move_fixed(bool seal) } /* mremap to move to fixed address */ - ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); + ret2 = sys_mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); @@ -1390,14 +1400,13 @@ static void test_seal_mremap_move_fixed_zero(bool seal) /* * MREMAP_FIXED can move the mapping to zero address */ - ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + ret2 = sys_mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, 0); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); } else { FAIL_TEST_IF_FALSE(ret2 == 0); - } REPORT_TEST_PASS(); @@ -1420,13 +1429,13 @@ static void test_seal_mremap_move_dontunmap(bool seal) } /* mremap to move, and don't unmap src addr. */ - ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0); + ret2 = sys_mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); } else { + /* kernel will allocate a new address */ FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); - } REPORT_TEST_PASS(); @@ -1434,7 +1443,7 @@ static void test_seal_mremap_move_dontunmap(bool seal) static void test_seal_mremap_move_dontunmap_anyaddr(bool seal) { - void *ptr; + void *ptr, *ptr2; unsigned long page_size = getpagesize(); unsigned long size = 4 * page_size; int ret; @@ -1449,24 +1458,30 @@ static void test_seal_mremap_move_dontunmap_anyaddr(bool seal) } /* - * The 0xdeaddead should not have effect on dest addr - * when MREMAP_DONTUNMAP is set. + * The new address is any address that not allocated. + * use allocate/free to similate that. */ - ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, - 0xdeaddead); + setup_single_address(size, &ptr2); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + ret = sys_munmap(ptr2, size); + FAIL_TEST_IF_FALSE(!ret); + + /* + * remap to any address. + */ + ret2 = sys_mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + (void *) ptr2); if (seal) { FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); FAIL_TEST_IF_FALSE(errno == EPERM); } else { - FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); - FAIL_TEST_IF_FALSE((long)ret2 != 0xdeaddead); - + /* remap success and return ptr2 */ + FAIL_TEST_IF_FALSE(ret2 == ptr2); } REPORT_TEST_PASS(); } - static void test_seal_merge_and_split(void) { void *ptr; From 07222371912c835d12b3541791d0516f50c7b45b Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Fri, 9 Aug 2024 10:26:18 -0700 Subject: [PATCH 089/416] memcg: replace memcg ID idr with xarray At the moment memcg IDs are managed through IDR which requires external synchronization mechanisms and makes the allocation code a bit awkward. Let's switch to xarray and make the code simpler. [shakeel.butt@linux.dev: fix error path in mem_cgroup_alloc(), per Dan] Link: https://lkml.kernel.org/r/20240815155402.3630804-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240809172618.2946790-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Suggested-by: Matthew Wilcox Reviewed-by: Roman Gushchin Reviewed-by: Matthew Wilcox (Oracle) Acked-by: Johannes Weiner Reviewed-by: Muchun Song Acked-by: Michal Hocko Cc: Dan Carpenter Signed-off-by: Andrew Morton --- mm/memcontrol.c | 39 ++++++++++----------------------------- 1 file changed, 10 insertions(+), 29 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8971d3473a7b..cb2996d5d3f4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3363,29 +3363,12 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) */ #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) -static DEFINE_IDR(mem_cgroup_idr); -static DEFINE_SPINLOCK(memcg_idr_lock); - -static int mem_cgroup_alloc_id(void) -{ - int ret; - - idr_preload(GFP_KERNEL); - spin_lock(&memcg_idr_lock); - ret = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX + 1, - GFP_NOWAIT); - spin_unlock(&memcg_idr_lock); - idr_preload_end(); - return ret; -} +static DEFINE_XARRAY_ALLOC1(mem_cgroup_ids); static void mem_cgroup_id_remove(struct mem_cgroup *memcg) { if (memcg->id.id > 0) { - spin_lock(&memcg_idr_lock); - idr_remove(&mem_cgroup_idr, memcg->id.id); - spin_unlock(&memcg_idr_lock); - + xa_erase(&mem_cgroup_ids, memcg->id.id); memcg->id.id = 0; } } @@ -3420,7 +3403,7 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg) struct mem_cgroup *mem_cgroup_from_id(unsigned short id) { WARN_ON_ONCE(!rcu_read_lock_held()); - return idr_find(&mem_cgroup_idr, id); + return xa_load(&mem_cgroup_ids, id); } #ifdef CONFIG_SHRINKER_DEBUG @@ -3513,17 +3496,17 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) struct mem_cgroup *memcg; int node, cpu; int __maybe_unused i; - long error = -ENOMEM; + long error; memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL); if (!memcg) - return ERR_PTR(error); + return ERR_PTR(-ENOMEM); - memcg->id.id = mem_cgroup_alloc_id(); - if (memcg->id.id < 0) { - error = memcg->id.id; + error = xa_alloc(&mem_cgroup_ids, &memcg->id.id, NULL, + XA_LIMIT(1, MEM_CGROUP_ID_MAX), GFP_KERNEL); + if (error) goto fail; - } + error = -ENOMEM; memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL_ACCOUNT); @@ -3664,9 +3647,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) * publish it here at the end of onlining. This matches the * regular ID destruction during offlining. */ - spin_lock(&memcg_idr_lock); - idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); - spin_unlock(&memcg_idr_lock); + xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL); return 0; offline_kmem: From 727d50a7e07259291981b5d84607dc9966def4b1 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Fri, 9 Aug 2024 10:59:06 -0400 Subject: [PATCH 090/416] mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To reduce redundancy, move common code to numa_migrate_prep() and rename the function to numa_migrate_check() to reflect its functionality. Now do_huge_pmd_numa_page() also checks shared folios to set TNF_SHARED flag. Link: https://lkml.kernel.org/r/20240809145906.1513458-4-ziy@nvidia.com Signed-off-by: Zi Yan Suggested-by: David Hildenbrand Reviewed-by: "Huang, Ying" Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Kefeng Wang Cc: Mel Gorman Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 29 +++++++++------------- mm/internal.h | 5 ++-- mm/memory.c | 63 +++++++++++++++++++++++++----------------------- 3 files changed, 47 insertions(+), 50 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 666fa675e5b6..f2fd3aabb67b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1669,22 +1669,23 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma, vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - pmd_t oldpmd = vmf->orig_pmd; - pmd_t pmd; struct folio *folio; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; int nid = NUMA_NO_NODE; - int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK); + int target_nid, last_cpupid; + pmd_t pmd, old_pmd; bool writable = false; int flags = 0; vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { + old_pmd = pmdp_get(vmf->pmd); + + if (unlikely(!pmd_same(old_pmd, vmf->orig_pmd))) { spin_unlock(vmf->ptl); return 0; } - pmd = pmd_modify(oldpmd, vma->vm_page_prot); + pmd = pmd_modify(old_pmd, vma->vm_page_prot); /* * Detect now whether the PMD could be writable; this information @@ -1699,18 +1700,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) if (!folio) goto out_map; - /* See similar comment in do_numa_page for explanation */ - if (!writable) - flags |= TNF_NO_GROUP; - nid = folio_nid(folio); - /* - * For memory tiering mode, cpupid of slow memory page is used - * to record page access time. So use default value. - */ - if (!folio_use_access_time(folio)) - last_cpupid = folio_last_cpupid(folio); - target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags); + + target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable, + &last_cpupid); if (target_nid == NUMA_NO_NODE) goto out_map; if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { @@ -1730,13 +1723,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) flags |= TNF_MIGRATE_FAIL; vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { spin_unlock(vmf->ptl); return 0; } out_map: /* Restore the PMD */ - pmd = pmd_modify(oldpmd, vma->vm_page_prot); + pmd = pmd_modify(pmdp_get(vmf->pmd), vma->vm_page_prot); pmd = pmd_mkyoung(pmd); if (writable) pmd = pmd_mkwrite(pmd, vma); diff --git a/mm/internal.h b/mm/internal.h index 1159b04e76a3..a479fe6e1895 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1191,8 +1191,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end); void __vunmap_range_noflush(unsigned long start, unsigned long end); -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, - unsigned long addr, int page_nid, int *flags); +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, + unsigned long addr, int *flags, bool writable, + int *last_cpupid); void free_zone_device_folio(struct folio *folio); int migrate_device_coherent_page(struct page *page); diff --git a/mm/memory.c b/mm/memory.c index 2ca87ceafede..d1c741a39630 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5200,16 +5200,43 @@ static vm_fault_t do_fault(struct vm_fault *vmf) return ret; } -int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf, - unsigned long addr, int page_nid, int *flags) +int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, + unsigned long addr, int *flags, + bool writable, int *last_cpupid) { struct vm_area_struct *vma = vmf->vma; + /* + * Avoid grouping on RO pages in general. RO pages shouldn't hurt as + * much anyway since they can be in shared cache state. This misses + * the case where a mapping is writable but the process never writes + * to it but pte_write gets cleared during protection updates and + * pte_dirty has unpredictable behaviour between PTE scan updates, + * background writeback, dirty balancing and application behaviour. + */ + if (!writable) + *flags |= TNF_NO_GROUP; + + /* + * Flag if the folio is shared between multiple address spaces. This + * is later used when determining whether to group tasks together + */ + if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) + *flags |= TNF_SHARED; + /* + * For memory tiering mode, cpupid of slow memory page is used + * to record page access time. So use default value. + */ + if (folio_use_access_time(folio)) + *last_cpupid = (-1 & LAST_CPUPID_MASK); + else + *last_cpupid = folio_last_cpupid(folio); + /* Record the current PID acceesing VMA */ vma_set_access_pid_bit(vma); count_vm_numa_event(NUMA_HINT_FAULTS); - if (page_nid == numa_node_id()) { + if (folio_nid(folio) == numa_node_id()) { count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); *flags |= TNF_FAULT_LOCAL; } @@ -5311,35 +5338,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map; - /* - * Avoid grouping on RO pages in general. RO pages shouldn't hurt as - * much anyway since they can be in shared cache state. This misses - * the case where a mapping is writable but the process never writes - * to it but pte_write gets cleared during protection updates and - * pte_dirty has unpredictable behaviour between PTE scan updates, - * background writeback, dirty balancing and application behaviour. - */ - if (!writable) - flags |= TNF_NO_GROUP; - - /* - * Flag if the folio is shared between multiple address spaces. This - * is later used when determining whether to group tasks together - */ - if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) - flags |= TNF_SHARED; - nid = folio_nid(folio); nr_pages = folio_nr_pages(folio); - /* - * For memory tiering mode, cpupid of slow memory page is used - * to record page access time. So use default value. - */ - if (folio_use_access_time(folio)) - last_cpupid = (-1 & LAST_CPUPID_MASK); - else - last_cpupid = folio_last_cpupid(folio); - target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags); + + target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags, + writable, &last_cpupid); if (target_nid == NUMA_NO_NODE) goto out_map; if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { From 3a80b8228f6ff0b3e56205231bef8ca69cb739d8 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:48 +0300 Subject: [PATCH 091/416] mm: reduce deferred struct page init ifdeffery Patch series "mm: Fix several issues with unaccepted memory", v2. The patchset addresses several issues related to unaccepted memory. Pacth 1/7 preparatory cleanup. Patch 2/7 ensures that __alloc_pages_bulk() will not exhaust all accepted memory without accepting more. Patches 3/7-5/7 are preparations for patch 6/7, which fixes alloc_config_page() on machines with unaccepted memory. This allows, for example, the allocation of gigantic pages at runtime. Patch 7/7 enables the kernel to accept memory up to the promo watermark. This patch (of 7): Add dummy _deferred_grow_zone() for !DEFERRED_STRUCT_PAGE_INIT and remove #ifdefs in two places. No functional changes. Link: https://lkml.kernel.org/r/20240809114854.3745464-1-kirill.shutemov@linux.intel.com Link: https://lkml.kernel.org/r/20240809114854.3745464-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Suggested-by: David Hildenbrand Cc: Borislav Petkov Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 60742d057b05..006cab0d2f2f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -322,6 +322,11 @@ static inline bool deferred_pages_enabled(void) { return false; } + +static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order) +{ + return false; +} #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ /* Return a pointer to the bitmap storing bits affecting a block of pages */ @@ -3395,7 +3400,6 @@ check_alloc_wmark: if (cond_accept_memory(zone, order)) goto try_this_zone; -#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* * Watermark failed for this zone, but see if we can * grow this zone if it contains deferred pages. @@ -3404,7 +3408,6 @@ check_alloc_wmark: if (_deferred_grow_zone(zone, order)) goto try_this_zone; } -#endif /* Checked here to keep the fast path fast */ BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (alloc_flags & ALLOC_NO_WATERMARKS) @@ -3450,13 +3453,11 @@ try_this_zone: if (cond_accept_memory(zone, order)) goto try_this_zone; -#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } -#endif } } From 4be9064baac0cdda43997e10cf9f706614057446 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:49 +0300 Subject: [PATCH 092/416] mm: accept memory in __alloc_pages_bulk() Currently, the kernel only accepts memory in get_page_from_freelist(), but there is another path that directly takes pages from free lists - __alloc_page_bulk(). This function can consume all accepted memory and will resort to __alloc_pages_noprof() if necessary. Conditionally accepted in __alloc_pages_bulk(). The same issue may arise due to deferred page initialization. Kick the deferred initialization machinery before abandoning the zone, as the kernel does in get_page_from_freelist(). Link: https://lkml.kernel.org/r/20240809114854.3745464-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Acked-by: David Hildenbrand Cc: Borislav Petkov Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 006cab0d2f2f..9dde1b04a04e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4577,12 +4577,23 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, goto failed; } + cond_accept_memory(zone, 0); +retry_this_zone: mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages; if (zone_watermark_fast(zone, 0, mark, zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp)) { break; } + + if (cond_accept_memory(zone, 0)) + goto retry_this_zone; + + /* Try again if zone has deferred pages */ + if (deferred_pages_enabled()) { + if (_deferred_grow_zone(zone, 0)) + goto retry_this_zone; + } } /* From 310183de7bb2ed114a80d057690ddb23d0cfe814 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:50 +0300 Subject: [PATCH 093/416] mm: introduce PageUnaccepted() page type The new page type allows physical memory scanners to detect unaccepted memory and handle it accordingly. The page type is serialized with zone lock. Link: https://lkml.kernel.org/r/20240809114854.3745464-5-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Acked-by: David Hildenbrand Cc: Borislav Petkov Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 8 ++++++++ mm/page_alloc.c | 2 ++ 2 files changed, 10 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a0a29bd092f8..17175a328e6b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -939,6 +939,7 @@ enum pagetype { PG_hugetlb = 0x04000000, PG_slab = 0x02000000, PG_zsmalloc = 0x01000000, + PG_unaccepted = 0x00800000, PAGE_TYPE_BASE = 0x80000000, @@ -1072,6 +1073,13 @@ FOLIO_TEST_FLAG_FALSE(hugetlb) PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) +/* + * Mark pages that has to be accepted before touched for the first time. + * + * Serialized with zone lock. + */ +PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted) + /** * PageHuge - Determine if the page belongs to hugetlbfs * @page: The page to test. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dde1b04a04e..fec7ed5a04f2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6963,6 +6963,7 @@ static bool try_to_accept_memory_one(struct zone *zone) account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); + __ClearPageUnaccepted(page); spin_unlock_irqrestore(&zone->lock, flags); accept_page(page, MAX_PAGE_ORDER); @@ -7021,6 +7022,7 @@ static bool __free_unaccepted(struct page *page) list_add_tail(&page->lru, &zone->unaccepted_pages); account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES); + __SetPageUnaccepted(page); spin_unlock_irqrestore(&zone->lock, flags); if (first) From 5adfeaecc487e7023f1c7bbdc081707d7a93110f Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:51 +0300 Subject: [PATCH 094/416] mm: rework accept memory helpers Make accept_memory() and range_contains_unaccepted_memory() take 'start' and 'size' arguments instead of 'start' and 'end'. Remove accept_page(), replacing it with direct calls to accept_memory(). The accept_page() name is going to be used for a different function. Link: https://lkml.kernel.org/r/20240809114854.3745464-6-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Suggested-by: David Hildenbrand Cc: Borislav Petkov Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/boot/compressed/misc.c | 2 +- arch/x86/boot/compressed/misc.h | 2 +- drivers/firmware/efi/libstub/efistub.h | 2 +- .../firmware/efi/libstub/unaccepted_memory.c | 3 ++- drivers/firmware/efi/unaccepted_memory.c | 18 ++++++++++-------- include/linux/mm.h | 12 +++++------- mm/memblock.c | 2 +- mm/mm_init.c | 2 +- mm/page_alloc.c | 19 +++---------------- tools/testing/memblock/internal.h | 2 +- 10 files changed, 26 insertions(+), 38 deletions(-) diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c index 944454306ef4..04a35b2c26e9 100644 --- a/arch/x86/boot/compressed/misc.c +++ b/arch/x86/boot/compressed/misc.c @@ -511,7 +511,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output) if (init_unaccepted_memory()) { debug_putstr("Accepting memory... "); - accept_memory(__pa(output), __pa(output) + needed_size); + accept_memory(__pa(output), needed_size); } entry_offset = decompress_kernel(output, virt_addr, error); diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index b353a7be380c..dd8d1a85f671 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -256,6 +256,6 @@ static inline bool init_unaccepted_memory(void) { return false; } /* Defined in EFI stub */ extern struct efi_unaccepted_memory *unaccepted_table; -void accept_memory(phys_addr_t start, phys_addr_t end); +void accept_memory(phys_addr_t start, unsigned long size); #endif /* BOOT_COMPRESSED_MISC_H */ diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h index d33ccbc4a2c6..685098f9626f 100644 --- a/drivers/firmware/efi/libstub/efistub.h +++ b/drivers/firmware/efi/libstub/efistub.h @@ -1229,7 +1229,7 @@ efi_zboot_entry(efi_handle_t handle, efi_system_table_t *systab); efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc, struct efi_boot_memmap *map); void process_unaccepted_memory(u64 start, u64 end); -void accept_memory(phys_addr_t start, phys_addr_t end); +void accept_memory(phys_addr_t start, unsigned long size); void arch_accept_memory(phys_addr_t start, phys_addr_t end); #endif diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c index c295ea3a6efc..757dbe734a47 100644 --- a/drivers/firmware/efi/libstub/unaccepted_memory.c +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c @@ -177,9 +177,10 @@ void process_unaccepted_memory(u64 start, u64 end) start / unit_size, (end - start) / unit_size); } -void accept_memory(phys_addr_t start, phys_addr_t end) +void accept_memory(phys_addr_t start, unsigned long size) { unsigned long range_start, range_end; + phys_addr_t end = start + size; unsigned long bitmap_size; u64 unit_size; diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c index 50f6503fe49f..c2c067eff634 100644 --- a/drivers/firmware/efi/unaccepted_memory.c +++ b/drivers/firmware/efi/unaccepted_memory.c @@ -30,11 +30,12 @@ static LIST_HEAD(accepting_list); * - memory that is below phys_base; * - memory that is above the memory that addressable by the bitmap; */ -void accept_memory(phys_addr_t start, phys_addr_t end) +void accept_memory(phys_addr_t start, unsigned long size) { struct efi_unaccepted_memory *unaccepted; unsigned long range_start, range_end; struct accept_range range, *entry; + phys_addr_t end = start + size; unsigned long flags; u64 unit_size; @@ -74,13 +75,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end) * "guard" page is accepted in addition to the memory that needs to be * used: * - * 1. Implicitly extend the range_contains_unaccepted_memory(start, end) - * checks up to end+unit_size if 'end' is aligned on a unit_size - * boundary. + * 1. Implicitly extend the range_contains_unaccepted_memory(start, size) + * checks up to the next unit_size if 'start+size' is aligned on a + * unit_size boundary. * - * 2. Implicitly extend accept_memory(start, end) to end+unit_size if - * 'end' is aligned on a unit_size boundary. (immediately following - * this comment) + * 2. Implicitly extend accept_memory(start, size) to the next unit_size + * if 'size+end' is aligned on a unit_size boundary. (immediately + * following this comment) */ if (!(end % unit_size)) end += unit_size; @@ -156,9 +157,10 @@ retry: spin_unlock_irqrestore(&unaccepted_memory_lock, flags); } -bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end) +bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size) { struct efi_unaccepted_memory *unaccepted; + phys_addr_t end = start + size; unsigned long flags; bool ret = false; u64 unit_size; diff --git a/include/linux/mm.h b/include/linux/mm.h index ee8cea73d415..b1eed30fdc06 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4062,18 +4062,18 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, #ifdef CONFIG_UNACCEPTED_MEMORY -bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end); -void accept_memory(phys_addr_t start, phys_addr_t end); +bool range_contains_unaccepted_memory(phys_addr_t start, unsigned long size); +void accept_memory(phys_addr_t start, unsigned long size); #else static inline bool range_contains_unaccepted_memory(phys_addr_t start, - phys_addr_t end) + unsigned long size) { return false; } -static inline void accept_memory(phys_addr_t start, phys_addr_t end) +static inline void accept_memory(phys_addr_t start, unsigned long size) { } @@ -4081,9 +4081,7 @@ static inline void accept_memory(phys_addr_t start, phys_addr_t end) static inline bool pfn_is_unaccepted_memory(unsigned long pfn) { - phys_addr_t paddr = pfn << PAGE_SHIFT; - - return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE); + return range_contains_unaccepted_memory(pfn << PAGE_SHIFT, PAGE_SIZE); } void vma_pgtable_walk_begin(struct vm_area_struct *vma); diff --git a/mm/memblock.c b/mm/memblock.c index 3b9dc2d89b8a..0a77a748a8eb 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1500,7 +1500,7 @@ done: * * Accept the memory of the allocated buffer. */ - accept_memory(found, found + size); + accept_memory(found, size); return found; } diff --git a/mm/mm_init.c b/mm/mm_init.c index 51960079875b..f3106b6299d3 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1939,7 +1939,7 @@ static void __init deferred_free_pages(unsigned long pfn, } /* Accept chunks smaller than MAX_PAGE_ORDER upfront */ - accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages)); + accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE); for (i = 0; i < nr_pages; i++, page++, pfn++) { if (pageblock_aligned(pfn)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fec7ed5a04f2..6ba88929c1e3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -286,7 +286,6 @@ EXPORT_SYMBOL(nr_online_nodes); #endif static bool page_contains_unaccepted(struct page *page, unsigned int order); -static void accept_page(struct page *page, unsigned int order); static bool cond_accept_memory(struct zone *zone, unsigned int order); static inline bool has_unaccepted_memory(void); static bool __free_unaccepted(struct page *page); @@ -1268,7 +1267,7 @@ void __meminit __free_pages_core(struct page *page, unsigned int order, if (order == MAX_PAGE_ORDER && __free_unaccepted(page)) return; - accept_page(page, order); + accept_memory(page_to_phys(page), PAGE_SIZE << order); } /* @@ -6932,16 +6931,8 @@ early_param("accept_memory", accept_memory_parse); static bool page_contains_unaccepted(struct page *page, unsigned int order) { phys_addr_t start = page_to_phys(page); - phys_addr_t end = start + (PAGE_SIZE << order); - return range_contains_unaccepted_memory(start, end); -} - -static void accept_page(struct page *page, unsigned int order) -{ - phys_addr_t start = page_to_phys(page); - - accept_memory(start, start + (PAGE_SIZE << order)); + return range_contains_unaccepted_memory(start, PAGE_SIZE << order); } static bool try_to_accept_memory_one(struct zone *zone) @@ -6966,7 +6957,7 @@ static bool try_to_accept_memory_one(struct zone *zone) __ClearPageUnaccepted(page); spin_unlock_irqrestore(&zone->lock, flags); - accept_page(page, MAX_PAGE_ORDER); + accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER); __free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL); @@ -7038,10 +7029,6 @@ static bool page_contains_unaccepted(struct page *page, unsigned int order) return false; } -static void accept_page(struct page *page, unsigned int order) -{ -} - static bool cond_accept_memory(struct zone *zone, unsigned int order) { return false; diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h index f6c6e5474c3a..1cf82acb2a3e 100644 --- a/tools/testing/memblock/internal.h +++ b/tools/testing/memblock/internal.h @@ -20,7 +20,7 @@ void memblock_free_pages(struct page *page, unsigned long pfn, { } -static inline void accept_memory(phys_addr_t start, phys_addr_t end) +static inline void accept_memory(phys_addr_t start, unsigned long size) { } From 55ad43e8ba0f5ccc9792846479839c4affb04660 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:52 +0300 Subject: [PATCH 095/416] mm: add a helper to accept page Accept a given struct page and add it free list. The help is useful for physical memory scanners that want to use free unaccepted memory. Link: https://lkml.kernel.org/r/20240809114854.3745464-7-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Acked-by: David Hildenbrand Cc: Borislav Petkov Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/internal.h | 8 ++++++++ mm/page_alloc.c | 53 +++++++++++++++++++++++++++++++++++-------------- 2 files changed, 46 insertions(+), 15 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index a479fe6e1895..acda347620c6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1432,4 +1432,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long new_addr, unsigned long len, bool need_rmap_locks, bool for_stack); +#ifdef CONFIG_UNACCEPTED_MEMORY +void accept_page(struct page *page); +#else /* CONFIG_UNACCEPTED_MEMORY */ +static inline void accept_page(struct page *page) +{ +} +#endif /* CONFIG_UNACCEPTED_MEMORY */ + #endif /* __MM_INTERNAL_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6ba88929c1e3..927f4e111273 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6935,11 +6935,46 @@ static bool page_contains_unaccepted(struct page *page, unsigned int order) return range_contains_unaccepted_memory(start, PAGE_SIZE << order); } +static void __accept_page(struct zone *zone, unsigned long *flags, + struct page *page) +{ + bool last; + + list_del(&page->lru); + last = list_empty(&zone->unaccepted_pages); + + account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); + __ClearPageUnaccepted(page); + spin_unlock_irqrestore(&zone->lock, *flags); + + accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER); + + __free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL); + + if (last) + static_branch_dec(&zones_with_unaccepted_pages); +} + +void accept_page(struct page *page) +{ + struct zone *zone = page_zone(page); + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + if (!PageUnaccepted(page)) { + spin_unlock_irqrestore(&zone->lock, flags); + return; + } + + /* Unlocks zone->lock */ + __accept_page(zone, &flags, page); +} + static bool try_to_accept_memory_one(struct zone *zone) { unsigned long flags; struct page *page; - bool last; spin_lock_irqsave(&zone->lock, flags); page = list_first_entry_or_null(&zone->unaccepted_pages, @@ -6949,20 +6984,8 @@ static bool try_to_accept_memory_one(struct zone *zone) return false; } - list_del(&page->lru); - last = list_empty(&zone->unaccepted_pages); - - account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); - __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); - __ClearPageUnaccepted(page); - spin_unlock_irqrestore(&zone->lock, flags); - - accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER); - - __free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL); - - if (last) - static_branch_dec(&zones_with_unaccepted_pages); + /* Unlocks zone->lock */ + __accept_page(zone, &flags, page); return true; } From e44dd9b13392b4c8fe3bcdeb13c46759259fcabc Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:53 +0300 Subject: [PATCH 096/416] mm: page_isolation: handle unaccepted memory isolation Page isolation machinery doesn't know anything about unaccepted memory and considers it non-free. It leads to alloc_contig_pages() failure. Treat unaccepted memory as free and accept memory on pageblock isolation. Once memory is accepted it becomes PageBuddy() and page isolation knows how to deal with them. Link: https://lkml.kernel.org/r/20240809114854.3745464-8-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Cc: Borislav Petkov Cc: David Hildenbrand Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_isolation.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 042937d5abe4..39fb8c07aeb7 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -152,6 +152,9 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unsigned long flags; unsigned long check_unmovable_start, check_unmovable_end; + if (PageUnaccepted(page)) + accept_page(page); + spin_lock_irqsave(&zone->lock, flags); /* @@ -367,6 +370,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, VM_BUG_ON(!page); pfn = page_to_pfn(page); + if (PageUnaccepted(page)) { + pfn += MAX_ORDER_NR_PAGES; + continue; + } + if (PageBuddy(page)) { int order = buddy_order(page); From 59149bf8cea9e3b0e6368f68ae31bf91cdec1eb4 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 9 Aug 2024 14:48:54 +0300 Subject: [PATCH 097/416] mm: accept to promo watermark Commit c574bbe91703 ("NUMA balancing: optimize page placement for memory tiering system") introduced a new watermark above "high" -- "promo". Accept memory memory to the highest watermark which is WMARK_PROMO now. Link: https://lkml.kernel.org/r/20240809114854.3745464-9-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Cc: Borislav Petkov Cc: David Hildenbrand Cc: Johannes Weiner Cc: Matthew Wilcox Cc: Mel Gorman Cc: Mike Rapoport (Microsoft) Cc: Tom Lendacky Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 927f4e111273..56a93805561a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7001,8 +7001,8 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) if (list_empty(&zone->unaccepted_pages)) return false; - /* How much to accept to get to high watermark? */ - to_accept = high_wmark_pages(zone) - + /* How much to accept to get to promo watermark? */ + to_accept = promo_wmark_pages(zone) - (zone_page_state(zone, NR_FREE_PAGES) - __zone_watermark_unusable_free(zone, order, 0) - zone_page_state(zone, NR_UNACCEPTED)); From 6963f00813f49375360544fe923e62f2070601af Mon Sep 17 00:00:00 2001 From: Miao Wang Date: Wed, 14 Aug 2024 01:12:13 +0800 Subject: [PATCH 098/416] mm: vmalloc: add optimization hint on page existence check In commit 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped"), a BUG_ON macro was changed into an if statement, where the compiler optimization hint introduced in the BUG_ON macro was removed along with this change. This patch adds back the hint. Link: https://lkml.kernel.org/r/20240814-fix_vmap_unlikely-v1-1-cd7954775f12@gmail.com Fixes: 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped") Signed-off-by: Miao Wang Cc: Christoph Hellwig Cc: Hariom Panthi Cc: "Uladzislau Rezki (Sony)" Signed-off-by: Andrew Morton --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 884e21fed28f..dfa8f09e7e9d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -105,7 +105,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (!pte) return -ENOMEM; do { - if (!pte_none(ptep_get(pte))) { + if (unlikely(!pte_none(ptep_get(pte)))) { if (pfn_valid(pfn)) { page = pfn_to_page(pfn); dump_page(page, "remapping already mapped page"); From bceeeaed4817ba7ad9013b4116c97220a60fcf7c Mon Sep 17 00:00:00 2001 From: Yuanchu Xie Date: Tue, 13 Aug 2024 09:37:59 -0700 Subject: [PATCH 099/416] mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the non-leaf pmd accessed bit and ignore the accessed bit on the pte if it's on a different node, skipping a generation update as well. If another scan occurs on the same node as said skipped pte. The non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for workingset estimation of a node/memcg. In more detail, get_pfn_folio returns NULL if the folio's nid != node under scanning, so the page table walk skips processing of said pte. Now the pmd_young flag on this pmd is cleared, and if none of the pte's are accessed before another scan occurs on the folio's node, the pmd_young check fails and the pte accessed bit is skipped. Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit. Link: https://lkml.kernel.org/r/20240813163759.742675-1-yuanchu@google.com Signed-off-by: Yuanchu Xie Acked-by: Yu Zhao Cc: "Huang, Ying" Cc: Lance Yang Signed-off-by: Andrew Morton --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 96ce889ea3d0..75a55a855fef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3481,7 +3481,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area goto next; if (!pmd_trans_huge(pmd[i])) { - if (should_clear_pmd_young()) + if (!walk->force_scan && should_clear_pmd_young()) pmdp_test_and_clear_young(vma, addr, pmd + i); goto next; } @@ -3568,7 +3568,7 @@ restart: walk->mm_stats[MM_NONLEAF_TOTAL]++; - if (should_clear_pmd_young()) { + if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue; From 5b198b4759ef7e618817d390601740b30ff56041 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:19 -0400 Subject: [PATCH 100/416] mm/dax: dump start address in fault handler Patch series "mm/mprotect: Fix dax puds", v5. Dax supports pud pages for a while, but mprotect on puds was missing since the start. This series tries to fix that by providing pud handling in mprotect(). The goal is to add more types of pud mappings like hugetlb or pfnmaps. This series paves way for it by fixing known pud entries. Considering nobody reported this until when I looked at those other types of pud mappings, I am thinking maybe it doesn't need to be a fix for stable and this may not need to be backported. I would guess whoever cares about mprotect() won't care 1G dax puds yet, vice versa. I hope fixing that in new kernels would be fine, but I'm open to suggestions. There're a few small things changed to teach mprotect work on PUDs. E.g. it will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop making sense when there can be more than one type of huge pte. OTOH, we'll also need to push the mmu notifiers from pmd to pud layers, which might need some attention but so far I think it's safe. For such details, please refer to each patch's commit message. The mprotect() pud process should be straightforward, as I kept it as simple as possible. There's no NUMA handled as dax simply doesn't support that. There's also no userfault involvements as file memory (even if work with userfault-wp async mode) will need to split a pud, so pud entry doesn't need to yet know userfault's existance (but hugetlb entries will; that's also for later). This patch (of 7): Currently the dax fault handler dumps the vma range when dynamic debugging enabled. That's mostly not useful. Dump the (aligned) address instead with the order info. Link: https://lkml.kernel.org/r/20240812181225.1360970-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240812181225.1360970-2-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: David Hildenbrand Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Christophe Leroy Cc: Dan Williams Cc: Dave Hansen Cc: Dave Jiang Cc: David Rientjes Cc: "Edgecombe, Rick P" Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- drivers/dax/device.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 2051e4f73c8a..9c1a729cd77e 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order) int id; struct dev_dax *dev_dax = filp->private_data; - dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm, - (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", - vmf->vma->vm_start, vmf->vma->vm_end, order); + dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm, + (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", + vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order); id = dax_read_lock(); if (order == 0) From 7f06e3aa2e836c11a70d161ae19dc4414eecd80d Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:20 -0400 Subject: [PATCH 101/416] mm/mprotect: push mmu notifier to PUDs mprotect() does mmu notifiers in PMD levels. It's there since 2014 of commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to change_pmd_range"). At that time, the issue was that NUMA balancing can be applied on a huge range of VM memory, even if nothing was populated. The notification can be avoided in this case if no valid pmd detected, which includes either THP or a PTE pgtable page. Now to pave way for PUD handling, this isn't enough. We need to generate mmu notifications even on PUD entries properly. mprotect() is currently broken on PUD (e.g., one can easily trigger kernel error with dax 1G mappings already), this is the start to fix it. To fix that, this patch proposes to push such notifications to the PUD layers. There is risk on regressing the problem Rik wanted to resolve before, but I think it shouldn't really happen, and I still chose this solution because of a few reasons: 1) Consider a large VM that should definitely contain more than GBs of memory, it's highly likely that PUDs are also none. In this case there will have no regression. 2) KVM has evolved a lot over the years to get rid of rmap walks, which might be the major cause of the previous soft-lockup. At least TDP MMU already got rid of rmap as long as not nested (which should be the major use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM pgtable (e.g. EPT on x86), the invalidation of a full empty region in most cases could be pretty fast now, comparing to 2014. 3) KVM has explicit code paths now to even give way for mmu notifiers just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). It'll also avoid contentions that may also contribute to a soft-lockup. 4) Stick with PMD layer simply don't work when PUD is there... We need one way or another to fix PUD mappings on mprotect(). Pushing it to PUD should be the safest approach as of now, e.g. there's yet no sign of huge P4D coming on any known archs. Link: https://lkml.kernel.org/r/20240812181225.1360970-3-peterx@redhat.com Signed-off-by: Peter Xu Cc: Sean Christopherson Cc: Paolo Bonzini Cc: David Rientjes Cc: Rik van Riel Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Christophe Leroy Cc: Dan Williams Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: "Edgecombe, Rick P" Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mprotect.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 37cf8d249405..d423080e6509 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, unsigned long next; long pages = 0; unsigned long nr_huge_updates = 0; - struct mmu_notifier_range range; - - range.start = 0; pmd = pmd_offset(pud, addr); do { @@ -383,14 +380,6 @@ again: if (pmd_none(*pmd)) goto next; - /* invoke the mmu notifier if the pmd is populated */ - if (!range.start) { - mmu_notifier_range_init(&range, - MMU_NOTIFY_PROTECTION_VMA, 0, - vma->vm_mm, addr, end); - mmu_notifier_invalidate_range_start(&range); - } - _pmd = pmdp_get_lockless(pmd); if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || @@ -431,9 +420,6 @@ next: cond_resched(); } while (pmd++, addr = next, addr != end); - if (range.start) - mmu_notifier_invalidate_range_end(&range); - if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; @@ -443,22 +429,36 @@ static inline long change_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { + struct mmu_notifier_range range; pud_t *pud; unsigned long next; long pages = 0, ret; + range.start = 0; + pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); ret = change_prepare(vma, pud, pmd, addr, cp_flags); - if (ret) - return ret; + if (ret) { + pages = ret; + break; + } if (pud_none_or_clear_bad(pud)) continue; + if (!range.start) { + mmu_notifier_range_init(&range, + MMU_NOTIFY_PROTECTION_VMA, 0, + vma->vm_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); + } pages += change_pmd_range(tlb, vma, pud, addr, next, newprot, cp_flags); } while (pud++, addr = next, addr != end); + if (range.start) + mmu_notifier_invalidate_range_end(&range); + return pages; } From 4dd7724f02abe48ff3bbbe873568754e011cfe4c Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:21 -0400 Subject: [PATCH 102/416] mm/powerpc: add missing pud helpers Some new helpers will be needed for pud entry updates soon. Introduce these helpers by referencing the pmd ones. Namely: - pudp_invalidate(): this helper invalidates a huge pud before a split happens, so that the invalidated pud entry will make sure no race will happen (either with software, like a concurrent zap, or hardware, like a/d bit lost). - pud_modify(): this helper applies a new pgprot to an existing huge pud mapping. For more information on why we need these two helpers, please refer to the corresponding pmd helpers in the mprotect() code path. Link: https://lkml.kernel.org/r/20240812181225.1360970-4-peterx@redhat.com Signed-off-by: Peter Xu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Dan Williams Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: David Rientjes Cc: "Edgecombe, Rick P" Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++ arch/powerpc/mm/book3s64/pgtable.c | 20 ++++++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 519b1743a0f4..5da92ba68a45 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1124,6 +1124,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); +extern pud_t pud_modify(pud_t pud, pgprot_t newprot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); extern void set_pud_at(struct mm_struct *mm, unsigned long addr, @@ -1384,6 +1385,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, #define __HAVE_ARCH_PMDP_INVALIDATE extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp); #define pmd_move_must_withdraw pmd_move_must_withdraw struct spinlock; diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index f4d8d3c40e5c..5a4a75369043 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, return __pmd(old_pmd); } +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp) +{ + unsigned long old_pud; + + VM_WARN_ON_ONCE(!pud_present(*pudp)); + old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID); + flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE); + return __pud(old_pud); +} + pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, int full) { @@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) pmdv &= _HPAGE_CHG_MASK; return pmd_set_protbits(__pmd(pmdv), newprot); } + +pud_t pud_modify(pud_t pud, pgprot_t newprot) +{ + unsigned long pudv; + + pudv = pud_val(pud); + pudv &= _HPAGE_CHG_MASK; + return pud_set_protbits(__pud(pudv), newprot); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* For use by kexec, called with MMU off */ From 144bb0aee33aa4185be7004affad23072ba82abd Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:22 -0400 Subject: [PATCH 103/416] mm/x86: make pud_leaf() only care about PSE bit When working on mprotect() on 1G dax entries, I hit an zap bad pud error when zapping a huge pud that is with PROT_NONE permission. Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits set to report a pud entry as a leaf, but that doesn't look right, as it's not following the pXd_leaf() definition that we stick with so far, where PROT_NONE entries should be reported as leaves. To fix it, change x86's pud_leaf() implementation to only check against PSE bit to report a leaf, irrelevant of whether PRESENT bit is set. Link: https://lkml.kernel.org/r/20240812181225.1360970-5-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: Dave Hansen Reviewed-by: David Hildenbrand Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Aneesh Kumar K.V Cc: Christophe Leroy Cc: Dan Williams Cc: Dave Jiang Cc: David Rientjes Cc: "Edgecombe, Rick P" Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e39311a89bf4..a2a3bd4c1bda 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud) #define pud_leaf pud_leaf static inline bool pud_leaf(pud_t pud) { - return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) == - (_PAGE_PSE | _PAGE_PRESENT); + return pud_val(pud) & _PAGE_PSE; } static inline int pud_bad(pud_t pud) From 1c399e74a97ca733d0780cc51819aa81fb292c22 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:23 -0400 Subject: [PATCH 104/416] mm/x86: implement arch_check_zapped_pud() Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps. It has the same logic as the PMD helper. One thing to mention is, it might be a good idea to use page_table_check in the future for trapping wrong setups of shadow stack pgtable entries [1]. That is left for the future as a separate effort. [1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com Link: https://lkml.kernel.org/r/20240812181225.1360970-6-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: David Hildenbrand Cc: "Edgecombe, Rick P" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Aneesh Kumar K.V Cc: Christophe Leroy Cc: Dan Williams Cc: Dave Jiang Cc: David Rientjes Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable.h | 10 ++++++++++ arch/x86/mm/pgtable.c | 6 ++++++ include/linux/pgtable.h | 6 ++++++ mm/huge_memory.c | 4 +++- 4 files changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a2a3bd4c1bda..fdb8ac9e7030 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -174,6 +174,13 @@ static inline int pud_young(pud_t pud) return pud_flags(pud) & _PAGE_ACCESSED; } +static inline bool pud_shstk(pud_t pud) +{ + return cpu_feature_enabled(X86_FEATURE_SHSTK) && + (pud_flags(pud) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) == + (_PAGE_DIRTY | _PAGE_PSE); +} + static inline int pte_write(pte_t pte) { /* @@ -1667,6 +1674,9 @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); #define arch_check_zapped_pmd arch_check_zapped_pmd void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd); +#define arch_check_zapped_pud arch_check_zapped_pud +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud); + #ifdef CONFIG_XEN_PV #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young static inline bool arch_has_hw_nonleaf_pmd_young(void) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index f5931499c2d6..36e7139a61d9 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -926,3 +926,9 @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd) VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pmd_shstk(pmd)); } + +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud) +{ + /* See note in arch_check_zapped_pte() */ + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud)); +} diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2a6a3cccfc36..780f3b439d98 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -447,6 +447,12 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, } #endif +#ifndef arch_check_zapped_pud +static inline void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud) +{ +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f2fd3aabb67b..608ea551da16 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2283,12 +2283,14 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr) { spinlock_t *ptl; + pud_t orig_pud; ptl = __pud_trans_huge_lock(pud, vma); if (!ptl) return 0; - pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); + orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); + arch_check_zapped_pud(vma, orig_pud); tlb_remove_pud_tlb_entry(tlb, pud, addr); if (vma_is_special_huge(vma)) { spin_unlock(ptl); From 473f24902e6a55de952e87b28450b7445a928da4 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:24 -0400 Subject: [PATCH 105/416] mm/x86: add missing pud helpers Some new helpers will be needed for pud entry updates soon. Introduce these helpers by referencing the pmd ones. Namely: - pudp_invalidate(): this helper invalidates a huge pud before a split happens, so that the invalidated pud entry will make sure no race will happen (either with software, like a concurrent zap, or hardware, like a/d bit lost). - pud_modify(): this helper applies a new pgprot to an existing huge pud mapping. For more information on why we need these two helpers, please refer to the corresponding pmd helpers in the mprotect() code path. When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack pgtable entries to reference pte_modify() to avoid duplicating the whole paragraph three times. Link: https://lkml.kernel.org/r/20240812181225.1360970-7-peterx@redhat.com Signed-off-by: Peter Xu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Aneesh Kumar K.V Cc: Christophe Leroy Cc: Dan Williams Cc: Dave Jiang Cc: David Hildenbrand Cc: David Rientjes Cc: "Edgecombe, Rick P" Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- arch/x86/include/asm/pgtable.h | 57 +++++++++++++++++++++++++++++----- arch/x86/mm/pgtable.c | 12 +++++++ 2 files changed, 61 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index fdb8ac9e7030..8d12bfad6a1d 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -787,6 +787,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd) __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); } +static inline pud_t pud_mkinvalid(pud_t pud) +{ + return pfn_pud(pud_pfn(pud), + __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); +} + static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask); static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) @@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) pmd_result = __pmd(val); /* - * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid: - * 1. Marking Write=0 PMDs Dirty=1 - * 2. Marking Dirty=1 PMDs Write=0 - * - * The first case cannot happen because the _PAGE_CHG_MASK will filter - * out any Dirty bit passed in newprot. Handle the second case by - * going through the mksaveddirty exercise. Only do this if the old - * value was Write=1 to avoid doing this on Shadow Stack PTEs. + * Avoid creating shadow stack PMD by accident. See comment in + * pte_modify(). */ if (oldval & _PAGE_RW) pmd_result = pmd_mksaveddirty(pmd_result); @@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) return pmd_result; } +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot) +{ + pudval_t val = pud_val(pud), oldval = val; + pud_t pud_result; + + val &= _HPAGE_CHG_MASK; + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK; + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK); + + pud_result = __pud(val); + + /* + * Avoid creating shadow stack PUD by accident. See comment in + * pte_modify(). + */ + if (oldval & _PAGE_RW) + pud_result = pud_mksaveddirty(pud_result); + else + pud_result = pud_clear_saveddirty(pud_result); + + return pud_result; +} + /* * mprotect needs to preserve PAT and encryption bits when updating * vm_page_prot @@ -1389,10 +1412,28 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, } #endif +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static inline pud_t pudp_establish(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, pud_t pud) +{ + page_table_check_pud_set(vma->vm_mm, pudp, pud); + if (IS_ENABLED(CONFIG_SMP)) { + return xchg(pudp, pud); + } else { + pud_t old = *pudp; + WRITE_ONCE(*pudp, pud); + return old; + } +} +#endif + #define __HAVE_ARCH_PMDP_INVALIDATE_AD extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp); + /* * Page table pages are page-aligned. The lower half of the top * level is used for userspace and the top half for the kernel. diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 36e7139a61d9..5745a354a241 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, } #endif +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ + defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp) +{ + VM_WARN_ON_ONCE(!pud_present(*pudp)); + pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp)); + flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE); + return old; +} +#endif + /** * reserve_top_address - reserves a hole in the top of kernel address space * @reserve - size of hole to reserve From cb0f01beb16669e91510fcdb2cea213931aee017 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 12 Aug 2024 14:12:25 -0400 Subject: [PATCH 106/416] mm/mprotect: fix dax pud handlings This is only relevant to the two archs that support PUD dax, aka, x86_64 and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not count in this case. DAX have had PUD mappings for years, but change protection path never worked. When the path is triggered in any form (a simple test program would be: call mprotect() on a 1G dev_dax mapping), the kernel will report "bad pud". This patch should fix that. The new change_huge_pud() tries to keep everything simple. For example, it doesn't optimize write bit as that will need even more PUD helpers. It's not too bad anyway to have one more write fault in the worst case once for 1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does it support userfault-wp bits, as there isn't such PUD mappings that is supported; file mappings always need a split there. The same to TLB shootdown: the pmd path (which was for x86 only) has the trick of using _ad() version of pmdp_invalidate*() which can avoid one redundant TLB, but let's also leave that for later. Again, the larger the mapping, the smaller of such effect. There's some difference on handling "retry" for change_huge_pud() (where it can return 0): it isn't like change_huge_pmd(), as the pmd version is safe with all conditions handled in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock(). In short, change_pte_range() is simply smarter. For that, change_pud_range() will need proper retry if it races with something else when a huge PUD changed from under us. The last thing to mention is currently the PUD path ignores the huge pte numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not applicable to NUMA, but also that it's ambiguous on its own to decide how to account pud in this case. In one earlier version of this patchset I proposed to remove the counter as it doesn't even look right to do the accounting as of now [1], but then a further discussion suggests we can leave that for later, as that doesn't block this series if we choose to ignore that counter. That's what this patch does, by ignoring it. When at it, touch up the comment in pgtable_split_needed() to make it generic to either pmd or pud file THPs. [1] https://lore.kernel.org/all/20240715192142.3241557-3-peterx@redhat.com/ [2] https://lore.kernel.org/r/added2d0-b8be-4108-82ca-1367a388d0b1@redhat.com Link: https://lkml.kernel.org/r/20240812181225.1360970-8-peterx@redhat.com Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Peter Xu Cc: Dan Williams Cc: Matthew Wilcox Cc: Dave Jiang Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Michael Ellerman Cc: Aneesh Kumar K.V Cc: Oscar Salvador Cc: Christophe Leroy Cc: David Hildenbrand Cc: David Rientjes Cc: "Edgecombe, Rick P" Cc: Nicholas Piggin Cc: Paolo Bonzini Cc: Rik van Riel Cc: Sean Christopherson Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 24 +++++++++++++++++++ mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 39 ++++++++++++++++++++++++------- 3 files changed, 107 insertions(+), 8 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ce44caa40eed..6370026689e0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -342,6 +342,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address); +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags); +#else +static inline int +change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) { return 0; } +#endif + #define split_huge_pud(__vma, __pud, __address) \ do { \ pud_t *____pud = (__pud); \ @@ -585,6 +596,19 @@ static inline int next_order(unsigned long *orders, int prev) { return 0; } + +static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, + unsigned long address) +{ +} + +static inline int change_huge_pud(struct mmu_gather *tlb, + struct vm_area_struct *vma, pud_t *pudp, + unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline int split_folio_to_list_to_order(struct folio *folio, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 608ea551da16..ac9e8a77059d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2104,6 +2104,53 @@ unlock: return ret; } +/* + * Returns: + * + * - 0: if pud leaf changed from under us + * - 1: if pud can be skipped + * - HPAGE_PUD_NR: if pud was successfully processed + */ +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) +{ + struct mm_struct *mm = vma->vm_mm; + pud_t oldpud, entry; + spinlock_t *ptl; + + tlb_change_page_size(tlb, HPAGE_PUD_SIZE); + + /* NUMA balancing doesn't apply to dax */ + if (cp_flags & MM_CP_PROT_NUMA) + return 1; + + /* + * Huge entries on userfault-wp only works with anonymous, while we + * don't have anonymous PUDs yet. + */ + if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL)) + return 1; + + ptl = __pud_trans_huge_lock(pudp, vma); + if (!ptl) + return 0; + + /* + * Can't clear PUD or it can race with concurrent zapping. See + * change_huge_pmd(). + */ + oldpud = pudp_invalidate(vma, addr, pudp); + entry = pud_modify(oldpud, newprot); + set_pud_at(mm, addr, pudp, entry); + tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE); + + spin_unlock(ptl); + return HPAGE_PUD_NR; +} +#endif + #ifdef CONFIG_USERFAULTFD /* * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by @@ -2334,6 +2381,11 @@ out: spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); } +#else +void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, + unsigned long address) +{ +} #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, diff --git a/mm/mprotect.c b/mm/mprotect.c index d423080e6509..446f8e5f10d9 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -302,8 +302,9 @@ pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags) { /* * pte markers only resides in pte level, if we need pte markers, - * we need to split. We cannot wr-protect shmem thp because file - * thp is handled differently when split by erasing the pmd so far. + * we need to split. For example, we cannot wr-protect a file thp + * (e.g. 2M shmem) because file thp is handled differently when + * split by erasing the pmd so far. */ return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); } @@ -430,31 +431,53 @@ static inline long change_pud_range(struct mmu_gather *tlb, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { struct mmu_notifier_range range; - pud_t *pud; + pud_t *pudp, pud; unsigned long next; long pages = 0, ret; range.start = 0; - pud = pud_offset(p4d, addr); + pudp = pud_offset(p4d, addr); do { +again: next = pud_addr_end(addr, end); - ret = change_prepare(vma, pud, pmd, addr, cp_flags); + ret = change_prepare(vma, pudp, pmd, addr, cp_flags); if (ret) { pages = ret; break; } - if (pud_none_or_clear_bad(pud)) + + pud = READ_ONCE(*pudp); + if (pud_none(pud)) continue; + if (!range.start) { mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0, vma->vm_mm, addr, end); mmu_notifier_invalidate_range_start(&range); } - pages += change_pmd_range(tlb, vma, pud, addr, next, newprot, + + if (pud_leaf(pud)) { + if ((next - addr != PUD_SIZE) || + pgtable_split_needed(vma, cp_flags)) { + __split_huge_pud(vma, pudp, addr); + goto again; + } else { + ret = change_huge_pud(tlb, vma, pudp, + addr, newprot, cp_flags); + if (ret == 0) + goto again; + /* huge pud was handled */ + if (ret == HPAGE_PUD_NR) + pages += HPAGE_PUD_NR; + continue; + } + } + + pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot, cp_flags); - } while (pud++, addr = next, addr != end); + } while (pudp++, addr = next, addr != end); if (range.start) mmu_notifier_invalidate_range_end(&range); From b6273b55d88539c6a7127a697c61d3f89c5831fe Mon Sep 17 00:00:00 2001 From: Takaya Saeki Date: Tue, 13 Aug 2024 10:03:12 +0000 Subject: [PATCH 107/416] filemap: add trace events for get_pages, map_pages, and fault To allow precise tracking of page caches accessed, add new tracepoints that trigger when a process actually accesses them. The ureadahead program used by ChromeOS traces the disk access of programs as they start up at boot up. It uses mincore(2) or the 'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores this information in a "pack" file and on subsequent boots, it will read the pack file and call readahead(2) on the information so that disk storage can be loaded into RAM before the applications actually need it. A problem we see is that due to the kernel's readahead algorithm that can aggressively pull in more data than needed (to try and accomplish the same goal) and this data is also recorded. The end result is that the pack file contains a lot of pages on disk that are never actually used. Calling readahead(2) on these unused pages can slow down the system boot up times. To solve this, add 3 new trace events, get_pages, map_pages, and fault. These will be used to trace the pages are not only pulled in from disk, but are actually used by the application. Only those pages will be stored in the pack file, and this helps out the performance of boot up. With the combination of these 3 new trace events and mm_filemap_add_to_page_cache, we observed a reduction in the pack file by 7.3% - 20% on ChromeOS varying by device. Link: https://lkml.kernel.org/r/20240813100312.3930505-1-takayas@chromium.org Signed-off-by: Takaya Saeki Reviewed-by: Masami Hiramatsu (Google) Reviewed-by: Steven Rostedt (Google) Cc: Junichi Uekawa Cc: Mathieu Desnoyers Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- include/trace/events/filemap.h | 84 ++++++++++++++++++++++++++++++++++ mm/filemap.c | 4 ++ 2 files changed, 88 insertions(+) diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h index 46c89c1e460c..f48fe637bfd2 100644 --- a/include/trace/events/filemap.h +++ b/include/trace/events/filemap.h @@ -56,6 +56,90 @@ DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_add_to_page_cache, TP_ARGS(folio) ); +DECLARE_EVENT_CLASS(mm_filemap_op_page_cache_range, + + TP_PROTO( + struct address_space *mapping, + pgoff_t index, + pgoff_t last_index + ), + + TP_ARGS(mapping, index, last_index), + + TP_STRUCT__entry( + __field(unsigned long, i_ino) + __field(dev_t, s_dev) + __field(unsigned long, index) + __field(unsigned long, last_index) + ), + + TP_fast_assign( + __entry->i_ino = mapping->host->i_ino; + if (mapping->host->i_sb) + __entry->s_dev = + mapping->host->i_sb->s_dev; + else + __entry->s_dev = mapping->host->i_rdev; + __entry->index = index; + __entry->last_index = last_index; + ), + + TP_printk( + "dev=%d:%d ino=%lx ofs=%lld-%lld", + MAJOR(__entry->s_dev), + MINOR(__entry->s_dev), __entry->i_ino, + ((loff_t)__entry->index) << PAGE_SHIFT, + ((((loff_t)__entry->last_index + 1) << PAGE_SHIFT) - 1) + ) +); + +DEFINE_EVENT(mm_filemap_op_page_cache_range, mm_filemap_get_pages, + TP_PROTO( + struct address_space *mapping, + pgoff_t index, + pgoff_t last_index + ), + TP_ARGS(mapping, index, last_index) +); + +DEFINE_EVENT(mm_filemap_op_page_cache_range, mm_filemap_map_pages, + TP_PROTO( + struct address_space *mapping, + pgoff_t index, + pgoff_t last_index + ), + TP_ARGS(mapping, index, last_index) +); + +TRACE_EVENT(mm_filemap_fault, + TP_PROTO(struct address_space *mapping, pgoff_t index), + + TP_ARGS(mapping, index), + + TP_STRUCT__entry( + __field(unsigned long, i_ino) + __field(dev_t, s_dev) + __field(unsigned long, index) + ), + + TP_fast_assign( + __entry->i_ino = mapping->host->i_ino; + if (mapping->host->i_sb) + __entry->s_dev = + mapping->host->i_sb->s_dev; + else + __entry->s_dev = mapping->host->i_rdev; + __entry->index = index; + ), + + TP_printk( + "dev=%d:%d ino=%lx ofs=%lld", + MAJOR(__entry->s_dev), + MINOR(__entry->s_dev), __entry->i_ino, + ((loff_t)__entry->index) << PAGE_SHIFT + ) +); + TRACE_EVENT(filemap_set_wb_err, TP_PROTO(struct address_space *mapping, errseq_t eseq), diff --git a/mm/filemap.c b/mm/filemap.c index 7665f2398d45..fcfec2a78a30 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2548,6 +2548,7 @@ retry: goto err; } + trace_mm_filemap_get_pages(mapping, index, last_index); return 0; err: if (err < 0) @@ -3279,6 +3280,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (unlikely(index >= max_idx)) return VM_FAULT_SIGBUS; + trace_mm_filemap_fault(mapping, index); + /* * Do we have something in the page cache already? */ @@ -3645,6 +3648,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, } while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL); add_mm_counter(vma->vm_mm, folio_type, rss); pte_unmap_unlock(vmf->pte, vmf->ptl); + trace_mm_filemap_map_pages(mapping, start_pgoff, end_pgoff); out: rcu_read_unlock(); From 67b9a353e171c3969223e53308feb15b722bb64a Mon Sep 17 00:00:00 2001 From: yangge Date: Tue, 13 Aug 2024 17:52:23 +0800 Subject: [PATCH 108/416] mm/swap: take folio refcount after testing the LRU flag Whoever passes a folio to __folio_batch_add_and_move() must hold a reference, otherwise something else would already be messed up. If the folio is referenced, it will not be freed elsewhere, so we can safely clear the folio's lru flag. As discussed with David in [1], we should take the reference after testing the LRU flag, not before. Link: https://lore.kernel.org/lkml/d41865b4-d6fa-49ba-890a-921eefad27dd@redhat.com/ [1] Link: https://lkml.kernel.org/r/1723542743-32179-1-git-send-email-yangge1116@126.com Signed-off-by: yangge Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/swap.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 67a246772811..6b838986d294 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -226,12 +226,10 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch, { unsigned long flags; - folio_get(folio); - - if (on_lru && !folio_test_clear_lru(folio)) { - folio_put(folio); + if (on_lru && !folio_test_clear_lru(folio)) return; - } + + folio_get(folio); if (disable_irq) local_lock_irqsave(&cpu_fbatches.lock_irq, flags); From c0f398c3b2cf67976bca216f80668b9c93368385 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Mon, 12 Aug 2024 16:48:23 -0600 Subject: [PATCH 109/416] mm/hugetlb_vmemmap: batch HVO work when demoting Batch the HVO work, including de-HVO of the source and HVO of the destination hugeTLB folios, to speed up demotion. After commit bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers"), each request of HVO or de-HVO, batched or not, invokes synchronize_rcu() once. For example, when not batched, demoting one 1GB hugeTLB folio to 512 2MB hugeTLB folios invokes synchronize_rcu() 513 times (1 de-HVO plus 512 HVO requests), whereas when batched, only twice (1 de-HVO plus 1 HVO request). And the performance difference between the two cases is significant, e.g., echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size time echo 100 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote Before this patch: real 8m58.158s user 0m0.009s sys 0m5.900s After this patch: real 0m0.900s user 0m0.000s sys 0m0.851s Note that this patch changes the behavior of the `demote` interface when de-HVO fails. Before, the interface aborts immediately upon failure; now, it tries to finish an entire batch, meaning it can make extra progress if the rest of the batch contains folios that do not need to de-HVO. Link: https://lkml.kernel.org/r/20240812224823.3914837-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao Reviewed-by: Muchun Song Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 186 +++++++++++++++++++++++++++++---------------------- 1 file changed, 107 insertions(+), 79 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1fdd9eab240c..d2b9555e6c45 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3921,100 +3921,124 @@ out: return 0; } -static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) +static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst, + struct list_head *src_list) { - int i, nid = folio_nid(folio); - struct hstate *target_hstate; - struct page *subpage; - struct folio *inner_folio; - int rc = 0; + long rc; + struct folio *folio, *next; + LIST_HEAD(dst_list); + LIST_HEAD(ret_list); - target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); - - remove_hugetlb_folio(h, folio, false); - spin_unlock_irq(&hugetlb_lock); - - /* - * If vmemmap already existed for folio, the remove routine above would - * have cleared the hugetlb folio flag. Hence the folio is technically - * no longer a hugetlb folio. hugetlb_vmemmap_restore_folio can only be - * passed hugetlb folios and will BUG otherwise. - */ - if (folio_test_hugetlb(folio)) { - rc = hugetlb_vmemmap_restore_folio(h, folio); - if (rc) { - /* Allocation of vmemmmap failed, we can not demote folio */ - spin_lock_irq(&hugetlb_lock); - add_hugetlb_folio(h, folio, false); - return rc; - } - } - - /* - * Use destroy_compound_hugetlb_folio_for_demote for all huge page - * sizes as it will not ref count folios. - */ - destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(h)); + rc = hugetlb_vmemmap_restore_folios(src, src_list, &ret_list); + list_splice_init(&ret_list, src_list); /* * Taking target hstate mutex synchronizes with set_max_huge_pages. * Without the mutex, pages added to target hstate could be marked * as surplus. * - * Note that we already hold h->resize_lock. To prevent deadlock, + * Note that we already hold src->resize_lock. To prevent deadlock, * use the convention of always taking larger size hstate mutex first. */ - mutex_lock(&target_hstate->resize_lock); - for (i = 0; i < pages_per_huge_page(h); - i += pages_per_huge_page(target_hstate)) { - subpage = folio_page(folio, i); - inner_folio = page_folio(subpage); - if (hstate_is_gigantic(target_hstate)) - prep_compound_gigantic_folio_for_demote(inner_folio, - target_hstate->order); - else - prep_compound_page(subpage, target_hstate->order); - folio_change_private(inner_folio, NULL); - prep_new_hugetlb_folio(target_hstate, inner_folio, nid); - free_huge_folio(inner_folio); - } - mutex_unlock(&target_hstate->resize_lock); + mutex_lock(&dst->resize_lock); - spin_lock_irq(&hugetlb_lock); + list_for_each_entry_safe(folio, next, src_list, lru) { + int i; + + if (folio_test_hugetlb_vmemmap_optimized(folio)) + continue; + + list_del(&folio->lru); + /* + * Use destroy_compound_hugetlb_folio_for_demote for all huge page + * sizes as it will not ref count folios. + */ + destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(src)); + + for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) { + struct page *page = folio_page(folio, i); + + if (hstate_is_gigantic(dst)) + prep_compound_gigantic_folio_for_demote(page_folio(page), + dst->order); + else + prep_compound_page(page, dst->order); + set_page_private(page, 0); + + init_new_hugetlb_folio(dst, page_folio(page)); + list_add(&page->lru, &dst_list); + } + } + + prep_and_add_allocated_folios(dst, &dst_list); + + mutex_unlock(&dst->resize_lock); + + return rc; +} + +static long demote_pool_huge_page(struct hstate *src, nodemask_t *nodes_allowed, + unsigned long nr_to_demote) + __must_hold(&hugetlb_lock) +{ + int nr_nodes, node; + struct hstate *dst; + long rc = 0; + long nr_demoted = 0; + + lockdep_assert_held(&hugetlb_lock); + + /* We should never get here if no demote order */ + if (!src->demote_order) { + pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n"); + return -EINVAL; /* internal error */ + } + dst = size_to_hstate(PAGE_SIZE << src->demote_order); + + for_each_node_mask_to_free(src, nr_nodes, node, nodes_allowed) { + LIST_HEAD(list); + struct folio *folio, *next; + + list_for_each_entry_safe(folio, next, &src->hugepage_freelists[node], lru) { + if (folio_test_hwpoison(folio)) + continue; + + remove_hugetlb_folio(src, folio, false); + list_add(&folio->lru, &list); + + if (++nr_demoted == nr_to_demote) + break; + } + + spin_unlock_irq(&hugetlb_lock); + + rc = demote_free_hugetlb_folios(src, dst, &list); + + spin_lock_irq(&hugetlb_lock); + + list_for_each_entry_safe(folio, next, &list, lru) { + list_del(&folio->lru); + add_hugetlb_folio(src, folio, false); + + nr_demoted--; + } + + if (rc < 0 || nr_demoted == nr_to_demote) + break; + } /* * Not absolutely necessary, but for consistency update max_huge_pages * based on pool changes for the demoted page. */ - h->max_huge_pages--; - target_hstate->max_huge_pages += - pages_per_huge_page(h) / pages_per_huge_page(target_hstate); + src->max_huge_pages -= nr_demoted; + dst->max_huge_pages += nr_demoted << (huge_page_order(src) - huge_page_order(dst)); - return rc; -} - -static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) - __must_hold(&hugetlb_lock) -{ - int nr_nodes, node; - struct folio *folio; - - lockdep_assert_held(&hugetlb_lock); - - /* We should never get here if no demote order */ - if (!h->demote_order) { - pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n"); - return -EINVAL; /* internal error */ - } - - for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { - list_for_each_entry(folio, &h->hugepage_freelists[node], lru) { - if (folio_test_hwpoison(folio)) - continue; - return demote_free_hugetlb_folio(h, folio); - } - } + if (rc < 0) + return rc; + if (nr_demoted) + return nr_demoted; /* * Only way to get here is if all pages on free lists are poisoned. * Return -EBUSY so that caller will not retry. @@ -4249,6 +4273,8 @@ static ssize_t demote_store(struct kobject *kobj, spin_lock_irq(&hugetlb_lock); while (nr_demote) { + long rc; + /* * Check for available pages to demote each time thorough the * loop as demote_pool_huge_page will drop hugetlb_lock. @@ -4261,11 +4287,13 @@ static ssize_t demote_store(struct kobject *kobj, if (!nr_available) break; - err = demote_pool_huge_page(h, n_mask); - if (err) + rc = demote_pool_huge_page(h, n_mask, nr_demote); + if (rc < 0) { + err = rc; break; + } - nr_demote--; + nr_demote -= rc; } spin_unlock_irq(&hugetlb_lock); From e1b8b883bb838339eec728de5d02e2f252a7d3d9 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Mon, 12 Aug 2024 15:05:42 -0400 Subject: [PATCH 110/416] maple_tree: reset mas->index and mas->last on write retries The following scenario can result in a race condition: Consider a node with the following indices and values a<------->b<----------->c<--------->d 0xA NULL 0xB CPU 1 CPU 2 --------- --------- mas_set_range(a,b) mas_erase() -> range is expanded (a,c) because of null expansion mas_nomem() mas_unlock() mas_store_range(b,c,0xC) The node now looks like: a<------->b<----------->c<--------->d 0xA 0xC 0xB mas_lock() mas_erase() <------ range of erase is still (a,c) The node is now NULL from (a,c) but the write from CPU 2 should have been retained and range (b,c) should still have 0xC as its value. We can fix this by re-intializing to the original index and last. This does not need a cc: Stable as there are no users of the maple tree which use internal locking and this condition is only possible with internal locking. Link: https://lkml.kernel.org/r/20240812190543.71967-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- lib/maple_tree.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index aa3a5df15b8e..b547ff211ac7 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5451,14 +5451,19 @@ EXPORT_SYMBOL_GPL(mas_store); */ int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp) { + unsigned long index = mas->index; + unsigned long last = mas->last; MA_WR_STATE(wr_mas, mas, entry); mas_wr_store_setup(&wr_mas); trace_ma_write(__func__, mas, 0, entry); retry: mas_wr_store_entry(&wr_mas); - if (unlikely(mas_nomem(mas, gfp))) + if (unlikely(mas_nomem(mas, gfp))) { + if (!entry) + __mas_set_range(mas, index, last); goto retry; + } if (unlikely(mas_is_err(mas))) return xa_err(mas->node); @@ -6245,23 +6250,26 @@ EXPORT_SYMBOL_GPL(mas_find_range_rev); void *mas_erase(struct ma_state *mas) { void *entry; + unsigned long index = mas->index; MA_WR_STATE(wr_mas, mas, NULL); if (!mas_is_active(mas) || !mas_is_start(mas)) mas->status = ma_start; - /* Retry unnecessary when holding the write lock. */ +write_retry: entry = mas_state_walk(mas); if (!entry) return NULL; -write_retry: /* Must reset to ensure spanning writes of last slot are detected */ mas_reset(mas); mas_wr_store_setup(&wr_mas); mas_wr_store_entry(&wr_mas); - if (mas_nomem(mas, GFP_KERNEL)) + if (mas_nomem(mas, GFP_KERNEL)) { + /* in case the range of entry changed when unlocked */ + mas->index = mas->last = index; goto write_retry; + } return entry; } From 617f8e4d76b81361884db852005f519012ddd244 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Mon, 12 Aug 2024 15:05:43 -0400 Subject: [PATCH 111/416] maple_tree: add test to replicate low memory race conditions Add new callback fields to the userspace implementation of struct kmem_cache. This allows for executing callback functions in order to further test low memory scenarios where node allocation is retried. This callback can help test race conditions by calling a function when a low memory event is tested. Link: https://lkml.kernel.org/r/20240812190543.71967-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- lib/maple_tree.c | 13 +++++++ tools/testing/radix-tree/maple.c | 63 ++++++++++++++++++++++++++++++++ tools/testing/shared/linux.c | 26 ++++++++++++- 3 files changed, 101 insertions(+), 1 deletion(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index b547ff211ac7..14d7864b8d53 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -7005,6 +7005,19 @@ void mt_set_non_kernel(unsigned int val) kmem_cache_set_non_kernel(maple_node_cache, val); } +extern void kmem_cache_set_callback(struct kmem_cache *cachep, + void (*callback)(void *)); +void mt_set_callback(void (*callback)(void *)) +{ + kmem_cache_set_callback(maple_node_cache, callback); +} + +extern void kmem_cache_set_private(struct kmem_cache *cachep, void *private); +void mt_set_private(void *private) +{ + kmem_cache_set_private(maple_node_cache, private); +} + extern unsigned long kmem_cache_get_alloc(struct kmem_cache *); unsigned long mt_get_alloc_size(void) { diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index cd1cf05503b4..ef5b83cf94ea 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -36224,6 +36224,65 @@ static noinline void __init check_mtree_dup(struct maple_tree *mt) extern void test_kmem_cache_bulk(void); +/* callback function used for check_nomem_writer_race() */ +static void writer2(void *maple_tree) +{ + struct maple_tree *mt = (struct maple_tree *)maple_tree; + MA_STATE(mas, mt, 6, 10); + + mtree_lock(mas.tree); + mas_store(&mas, xa_mk_value(0xC)); + mas_destroy(&mas); + mtree_unlock(mas.tree); +} + +/* + * check_nomem_writer_race() - test a possible race in the mas_nomem() path + * @mt: The tree to build. + * + * There is a possible race condition in low memory conditions when mas_nomem() + * gives up its lock. A second writer can chagne the entry that the primary + * writer executing the mas_nomem() path is modifying. This test recreates this + * scenario to ensure we are handling it correctly. + */ +static void check_nomem_writer_race(struct maple_tree *mt) +{ + MA_STATE(mas, mt, 0, 5); + + mt_set_non_kernel(0); + /* setup root with 2 values with NULL in between */ + mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL); + mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL); + mtree_store_range(mt, 11, 15, xa_mk_value(0xB), GFP_KERNEL); + + /* setup writer 2 that will trigger the race condition */ + mt_set_private(mt); + mt_set_callback(writer2); + + mtree_lock(mt); + /* erase 0-5 */ + mas_erase(&mas); + + /* index 6-10 should retain the value from writer 2 */ + check_load(mt, 6, xa_mk_value(0xC)); + mtree_unlock(mt); + + /* test for the same race but with mas_store_gfp() */ + mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL); + mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL); + + mas_set_range(&mas, 0, 5); + mtree_lock(mt); + mas_store_gfp(&mas, NULL, GFP_KERNEL); + + /* ensure write made by writer 2 is retained */ + check_load(mt, 6, xa_mk_value(0xC)); + + mt_set_private(NULL); + mt_set_callback(NULL); + mtree_unlock(mt); +} + void farmer_tests(void) { struct maple_node *node; @@ -36257,6 +36316,10 @@ void farmer_tests(void) check_dfs_preorder(&tree); mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_USE_RCU); + check_nomem_writer_race(&tree); + mtree_destroy(&tree); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE); check_prealloc(&tree); mtree_destroy(&tree); diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c index 4eb442206d01..17263696b5d8 100644 --- a/tools/testing/shared/linux.c +++ b/tools/testing/shared/linux.c @@ -26,8 +26,21 @@ struct kmem_cache { unsigned int non_kernel; unsigned long nr_allocated; unsigned long nr_tallocated; + bool exec_callback; + void (*callback)(void *); + void *private; }; +void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *)) +{ + cachep->callback = callback; +} + +void kmem_cache_set_private(struct kmem_cache *cachep, void *private) +{ + cachep->private = private; +} + void kmem_cache_set_non_kernel(struct kmem_cache *cachep, unsigned int val) { cachep->non_kernel = val; @@ -58,9 +71,17 @@ void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru, { void *p; + if (cachep->exec_callback) { + if (cachep->callback) + cachep->callback(cachep->private); + cachep->exec_callback = false; + } + if (!(gfp & __GFP_DIRECT_RECLAIM)) { - if (!cachep->non_kernel) + if (!cachep->non_kernel) { + cachep->exec_callback = true; return NULL; + } cachep->non_kernel--; } @@ -223,6 +244,9 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, ret->objs = NULL; ret->ctor = ctor; ret->non_kernel = 0; + ret->exec_callback = false; + ret->callback = NULL; + ret->private = NULL; return ret; } From 7a0529d0c2aa552498cc9b5c116f6e2ab8c2ac43 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 12 Aug 2024 15:09:24 +0000 Subject: [PATCH 112/416] maple_tree: fix comment typo of ma_root In comment of mas_start(), we lists the return value for different cases. In case of a single entry, we set mas->status to ma_root, while the comment uses mas_root, which is not a maple_status. Fix the typo according to the code. Link: https://lkml.kernel.org/r/20240812150925.31551-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Liam R. Howlett Signed-off-by: Andrew Morton --- lib/maple_tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 14d7864b8d53..b3a3d5a00577 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -1346,7 +1346,7 @@ static void mas_node_count(struct ma_state *mas, int count) * Return: * - If mas->node is an error or not mas_start, return NULL. * - If it's an empty tree: NULL & mas->status == ma_none - * - If it's a single entry: The entry & mas->status == mas_root + * - If it's a single entry: The entry & mas->status == ma_root * - If it's a tree: NULL & mas->status == safe root node. */ static inline struct maple_enode *mas_start(struct ma_state *mas) From c64d66153b3413095c57805e66b8b6e2ff8633c0 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 12 Aug 2024 15:09:25 +0000 Subject: [PATCH 113/416] maple_tree: fix comment typo with corresponding maple_status In comment of function mas_start(), we list the return value of different cases. According to the comment context, tell the maple_status here is more consistent with others. Let's correct it with ma_active in the case it's a tree. Link: https://lkml.kernel.org/r/20240812150925.31551-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Liam R. Howlett Signed-off-by: Andrew Morton --- lib/maple_tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index b3a3d5a00577..884e2d130876 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -1347,7 +1347,7 @@ static void mas_node_count(struct ma_state *mas, int count) * - If mas->node is an error or not mas_start, return NULL. * - If it's an empty tree: NULL & mas->status == ma_none * - If it's a single entry: The entry & mas->status == ma_root - * - If it's a tree: NULL & mas->status == safe root node. + * - If it's a tree: NULL & mas->status == ma_active */ static inline struct maple_enode *mas_start(struct ma_state *mas) { From c36be0cdf63d64dfd65bcf27b8ed400696b1c27a Mon Sep 17 00:00:00 2001 From: Tianchen Ding Date: Mon, 12 Aug 2024 17:55:17 +0800 Subject: [PATCH 114/416] kfence: save freeing stack trace at calling time instead of freeing time For kmem_cache with SLAB_TYPESAFE_BY_RCU, the freeing trace stack at calling kmem_cache_free() is more useful. While the following stack is meaningless and provides no help: freed by task 46 on cpu 0 at 656.840729s: rcu_do_batch+0x1ab/0x540 nocb_cb_wait+0x8f/0x260 rcu_nocb_cb_kthread+0x25/0x80 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 Link: https://lkml.kernel.org/r/20240812095517.2357-1-dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding Reviewed-by: Marco Elver Cc: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton --- mm/kfence/core.c | 39 +++++++++++++++++++++++++++++---------- mm/kfence/kfence.h | 1 + mm/kfence/report.c | 7 ++++--- 3 files changed, 34 insertions(+), 13 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index c3ef7eb8d4dc..67fc321db79b 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -273,6 +273,13 @@ static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *m return pageaddr; } +static inline bool kfence_obj_allocated(const struct kfence_metadata *meta) +{ + enum kfence_object_state state = READ_ONCE(meta->state); + + return state == KFENCE_OBJECT_ALLOCATED || state == KFENCE_OBJECT_RCU_FREEING; +} + /* * Update the object's metadata state, including updating the alloc/free stacks * depending on the state transition. @@ -282,10 +289,14 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex unsigned long *stack_entries, size_t num_stack_entries) { struct kfence_track *track = - next == KFENCE_OBJECT_FREED ? &meta->free_track : &meta->alloc_track; + next == KFENCE_OBJECT_ALLOCATED ? &meta->alloc_track : &meta->free_track; lockdep_assert_held(&meta->lock); + /* Stack has been saved when calling rcu, skip. */ + if (READ_ONCE(meta->state) == KFENCE_OBJECT_RCU_FREEING) + goto out; + if (stack_entries) { memcpy(track->stack_entries, stack_entries, num_stack_entries * sizeof(stack_entries[0])); @@ -301,6 +312,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex track->cpu = raw_smp_processor_id(); track->ts_nsec = local_clock(); /* Same source as printk timestamps. */ +out: /* * Pairs with READ_ONCE() in * kfence_shutdown_cache(), @@ -506,7 +518,7 @@ static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool z raw_spin_lock_irqsave(&meta->lock, flags); - if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) { + if (!kfence_obj_allocated(meta) || meta->addr != (unsigned long)addr) { /* Invalid or double-free, bail out. */ atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); kfence_report_error((unsigned long)addr, false, NULL, meta, @@ -784,7 +796,7 @@ static void kfence_check_all_canary(void) for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { struct kfence_metadata *meta = &kfence_metadata[i]; - if (meta->state == KFENCE_OBJECT_ALLOCATED) + if (kfence_obj_allocated(meta)) check_canary(meta); } } @@ -1010,12 +1022,11 @@ void kfence_shutdown_cache(struct kmem_cache *s) * the lock will not help, as different critical section * serialization will have the same outcome. */ - if (READ_ONCE(meta->cache) != s || - READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED) + if (READ_ONCE(meta->cache) != s || !kfence_obj_allocated(meta)) continue; raw_spin_lock_irqsave(&meta->lock, flags); - in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED; + in_use = meta->cache == s && kfence_obj_allocated(meta); raw_spin_unlock_irqrestore(&meta->lock, flags); if (in_use) { @@ -1160,11 +1171,19 @@ void __kfence_free(void *addr) * the object, as the object page may be recycled for other-typed * objects once it has been freed. meta->cache may be NULL if the cache * was destroyed. + * Save the stack trace here so that reports show where the user freed + * the object. */ - if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU))) + if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU))) { + unsigned long flags; + + raw_spin_lock_irqsave(&meta->lock, flags); + metadata_update_state(meta, KFENCE_OBJECT_RCU_FREEING, NULL, 0); + raw_spin_unlock_irqrestore(&meta->lock, flags); call_rcu(&meta->rcu_head, rcu_guarded_free); - else + } else { kfence_guarded_free(addr, meta, false); + } } bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs) @@ -1188,14 +1207,14 @@ bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs int distance = 0; meta = addr_to_metadata(addr - PAGE_SIZE); - if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { + if (meta && kfence_obj_allocated(meta)) { to_report = meta; /* Data race ok; distance calculation approximate. */ distance = addr - data_race(meta->addr + meta->size); } meta = addr_to_metadata(addr + PAGE_SIZE); - if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { + if (meta && kfence_obj_allocated(meta)) { /* Data race ok; distance calculation approximate. */ if (!to_report || distance > data_race(meta->addr) - addr) to_report = meta; diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h index db87a05047bd..dfba5ea06b01 100644 --- a/mm/kfence/kfence.h +++ b/mm/kfence/kfence.h @@ -38,6 +38,7 @@ enum kfence_object_state { KFENCE_OBJECT_UNUSED, /* Object is unused. */ KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */ + KFENCE_OBJECT_RCU_FREEING, /* Object was allocated, and then being freed by rcu. */ KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */ }; diff --git a/mm/kfence/report.c b/mm/kfence/report.c index 73a6fe42845a..451991a3a8f2 100644 --- a/mm/kfence/report.c +++ b/mm/kfence/report.c @@ -114,7 +114,8 @@ static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadat /* Timestamp matches printk timestamp format. */ seq_con_printf(seq, "%s by task %d on cpu %d at %lu.%06lus (%lu.%06lus ago):\n", - show_alloc ? "allocated" : "freed", track->pid, + show_alloc ? "allocated" : meta->state == KFENCE_OBJECT_RCU_FREEING ? + "rcu freeing" : "freed", track->pid, track->cpu, (unsigned long)ts_sec, rem_nsec / 1000, (unsigned long)interval_nsec, rem_interval_nsec / 1000); @@ -149,7 +150,7 @@ void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *met kfence_print_stack(seq, meta, true); - if (meta->state == KFENCE_OBJECT_FREED) { + if (meta->state == KFENCE_OBJECT_FREED || meta->state == KFENCE_OBJECT_RCU_FREEING) { seq_con_printf(seq, "\n"); kfence_print_stack(seq, meta, false); } @@ -318,7 +319,7 @@ bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *sla kpp->kp_slab_cache = meta->cache; kpp->kp_objp = (void *)meta->addr; kfence_to_kp_stack(&meta->alloc_track, kpp->kp_stack); - if (meta->state == KFENCE_OBJECT_FREED) + if (meta->state == KFENCE_OBJECT_FREED || meta->state == KFENCE_OBJECT_RCU_FREEING) kfence_to_kp_stack(&meta->free_track, kpp->kp_free_stack); /* get_stack_skipnr() ensures the first entry is outside allocator. */ kpp->kp_ret = kpp->kp_stack[0]; From 223febc6e5573cda5e94e6b38fcdaded068db777 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Mon, 12 Aug 2024 18:26:02 +1000 Subject: [PATCH 115/416] mm: add optional close() to struct vm_special_mapping Add an optional close() callback to struct vm_special_mapping. It will be used, by powerpc at least, to handle unmapping of the VDSO. Although support for unmapping the VDSO was initially added for CRIU[1], it is not desirable to guard that support behind CONFIG_CHECKPOINT_RESTORE. There are other known users of unmapping the VDSO which are not related to CRIU, eg. Valgrind [2] and void-ship [3]. The powerpc arch_unmap() hook has been in place for ~9 years, with no ifdef, so there may be other unknown users that have come to rely on unmapping the VDSO. Even if the code was behind an ifdef, major distros enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO depends on that configuration option. It's also undesirable to have such core mm behaviour behind a relatively obscure CONFIG option. Longer term the unmap behaviour should be standardised across architectures, however that is complicated by the fact the VDSO pointer is stored differently across architectures. There was a previous attempt to unify that handling [4], which could be revived. See [5] for further discussion. [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 [3]: https://github.com/insanitybit/void-ship [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/ [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 Link: https://lkml.kernel.org/r/20240812082605.743814-1-mpe@ellerman.id.au Signed-off-by: Michael Ellerman Suggested-by: Linus Torvalds Reviewed-by: David Hildenbrand Reviewed-by: Liam R. Howlett Cc: Christophe Leroy Cc: Jeff Xu Cc: Nicholas Piggin Cc: Pedro Falcato Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 3 +++ mm/mmap.c | 6 ++++++ 2 files changed, 9 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 003619fab20e..2419e60c9a7f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1324,6 +1324,9 @@ struct vm_special_mapping { int (*mremap)(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma); + + void (*close)(const struct vm_special_mapping *sm, + struct vm_area_struct *vma); }; enum tlb_flush_reason { diff --git a/mm/mmap.c b/mm/mmap.c index 4a9c2329b09a..933fdc1491a7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2045,10 +2045,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) static vm_fault_t special_mapping_fault(struct vm_fault *vmf); /* + * Close hook, called for unmap() and on the old vma for mremap(). + * * Having a close hook prevents vma merging regardless of flags. */ static void special_mapping_close(struct vm_area_struct *vma) { + const struct vm_special_mapping *sm = vma->vm_private_data; + + if (sm->close) + sm->close(sm, vma); } static const char *special_mapping_name(struct vm_area_struct *vma) From 5463bafab476e430a385c6be58249a49b7549629 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Mon, 12 Aug 2024 18:26:03 +1000 Subject: [PATCH 116/416] powerpc/mm: handle VDSO unmapping via close() rather than arch_unmap() Add a close() callback to the VDSO special mapping to handle unmapping of the VDSO. That will make it possible to remove the arch_unmap() hook entirely in a subsequent patch. Link: https://lkml.kernel.org/r/20240812082605.743814-2-mpe@ellerman.id.au Signed-off-by: Michael Ellerman Suggested-by: Linus Torvalds Reviewed-by: David Hildenbrand Reviewed-by: Liam R. Howlett Cc: Christophe Leroy Cc: Jeff Xu Cc: Nicholas Piggin Cc: Pedro Falcato Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/powerpc/include/asm/mmu_context.h | 4 ---- arch/powerpc/kernel/vdso.c | 17 +++++++++++++++++ 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 37bffa0f7918..9b8c1555744e 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -263,10 +263,6 @@ extern void arch_exit_mmap(struct mm_struct *mm); static inline void arch_unmap(struct mm_struct *mm, unsigned long start, unsigned long end) { - unsigned long vdso_base = (unsigned long)mm->context.vdso; - - if (start <= vdso_base && vdso_base < end) - mm->context.vdso = NULL; } #ifdef CONFIG_PPC_MEM_KEYS diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index 7a2ff9010f17..220a76cae7c1 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -81,6 +81,21 @@ static int vdso64_mremap(const struct vm_special_mapping *sm, struct vm_area_str return vdso_mremap(sm, new_vma, &vdso64_end - &vdso64_start); } +static void vdso_close(const struct vm_special_mapping *sm, struct vm_area_struct *vma) +{ + struct mm_struct *mm = vma->vm_mm; + + /* + * close() is called for munmap() but also for mremap(). In the mremap() + * case the vdso pointer has already been updated by the mremap() hook + * above, so it must not be set to NULL here. + */ + if (vma->vm_start != (unsigned long)mm->context.vdso) + return; + + mm->context.vdso = NULL; +} + static vm_fault_t vvar_fault(const struct vm_special_mapping *sm, struct vm_area_struct *vma, struct vm_fault *vmf); @@ -92,11 +107,13 @@ static struct vm_special_mapping vvar_spec __ro_after_init = { static struct vm_special_mapping vdso32_spec __ro_after_init = { .name = "[vdso]", .mremap = vdso32_mremap, + .close = vdso_close, }; static struct vm_special_mapping vdso64_spec __ro_after_init = { .name = "[vdso]", .mremap = vdso64_mremap, + .close = vdso_close, }; #ifdef CONFIG_TIME_NS From 40b88644dd92d99e572063893c2a247598c238d6 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Mon, 12 Aug 2024 18:26:04 +1000 Subject: [PATCH 117/416] mm: remove arch_unmap() Now that powerpc no longer uses arch_unmap() to handle VDSO unmapping, there are no meaningful implementions left. Drop support for it entirely, and update comments which refer to it. Link: https://lkml.kernel.org/r/20240812082605.743814-3-mpe@ellerman.id.au Signed-off-by: Michael Ellerman Suggested-by: Linus Torvalds Acked-by: David Hildenbrand Reviewed-by: Thomas Gleixner Reviewed-by: Liam R. Howlett Cc: Christophe Leroy Cc: Jeff Xu Cc: Nicholas Piggin Cc: Pedro Falcato Signed-off-by: Andrew Morton --- arch/powerpc/include/asm/mmu_context.h | 5 ----- arch/x86/include/asm/mmu_context.h | 5 ----- include/asm-generic/mm_hooks.h | 11 +++-------- mm/mmap.c | 4 +--- mm/vma.c | 8 ++------ 5 files changed, 6 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 9b8c1555744e..a334a1368848 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -260,11 +260,6 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, extern void arch_exit_mmap(struct mm_struct *mm); -static inline void arch_unmap(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ -} - #ifdef CONFIG_PPC_MEM_KEYS bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign); diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 8dac45a2c7fc..80f2a3187aa6 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -232,11 +232,6 @@ static inline bool is_64bit_mm(struct mm_struct *mm) } #endif -static inline void arch_unmap(struct mm_struct *mm, unsigned long start, - unsigned long end) -{ -} - /* * We only want to enforce protection keys on the current process * because we effectively have no access to PKRU for other diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h index 4dbb177d1150..6eea3b3c1e65 100644 --- a/include/asm-generic/mm_hooks.h +++ b/include/asm-generic/mm_hooks.h @@ -1,8 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Define generic no-op hooks for arch_dup_mmap, arch_exit_mmap - * and arch_unmap to be included in asm-FOO/mmu_context.h for any - * arch FOO which doesn't need to hook these. + * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap + * to be included in asm-FOO/mmu_context.h for any arch FOO which + * doesn't need to hook these. */ #ifndef _ASM_GENERIC_MM_HOOKS_H #define _ASM_GENERIC_MM_HOOKS_H @@ -17,11 +17,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm) { } -static inline void arch_unmap(struct mm_struct *mm, - unsigned long start, unsigned long end) -{ -} - static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign) { diff --git a/mm/mmap.c b/mm/mmap.c index 933fdc1491a7..3af256bacef3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1743,14 +1743,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm = vma->vm_mm; /* - * Check if memory is sealed before arch_unmap. - * Prevent unmapping a sealed VMA. + * Check if memory is sealed, prevent unmapping a sealed VMA. * can_modify_mm assumes we have acquired the lock on MM. */ if (unlikely(!can_modify_mm(mm, start, end))) return -EPERM; - arch_unmap(mm, start, end); return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); } diff --git a/mm/vma.c b/mm/vma.c index bf0546fe6eab..84965f2cd580 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -841,7 +841,7 @@ map_count_exceeded: * * This function takes a @mas that is either pointing to the previous VMA or set * to MA_START and sets it up to remove the mapping(s). The @len will be - * aligned and any arch_unmap work will be preformed. + * aligned. * * Return: 0 on success and drops the lock if so directed, error and leaves the * lock held otherwise. @@ -861,16 +861,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, return -EINVAL; /* - * Check if memory is sealed before arch_unmap. - * Prevent unmapping a sealed VMA. + * Check if memory is sealed, prevent unmapping a sealed VMA. * can_modify_mm assumes we have acquired the lock on MM. */ if (unlikely(!can_modify_mm(mm, start, end))) return -EPERM; - /* arch_unmap() might do unmaps itself. */ - arch_unmap(mm, start, end); - /* Find the first overlapping VMA */ vma = vma_find(vmi, end); if (!vma) { From edb4a8bffde725db0c9db2b4a508e01f150dba40 Mon Sep 17 00:00:00 2001 From: Michael Ellerman Date: Mon, 12 Aug 2024 18:26:05 +1000 Subject: [PATCH 118/416] powerpc/vdso: refactor error handling Linus noticed that the error handling in __arch_setup_additional_pages() fails to clear the mm VDSO pointer if _install_special_mapping() fails. In practice there should be no actual bug, because if there's an error the VDSO pointer is cleared later in arch_setup_additional_pages(). However it's no longer necessary to set the pointer before installing the mapping. Commit c1bab64360e6 ("powerpc/vdso: Move to _install_special_mapping() and remove arch_vma_name()") reworked the code so that the VMA name comes from the vm_special_mapping.name, rather than relying on arch_vma_name(). So rework the code to only set the VDSO pointer once the mappings have been installed correctly, and remove the stale comment. Link: https://lkml.kernel.org/r/20240812082605.743814-4-mpe@ellerman.id.au Signed-off-by: Michael Ellerman Reviewed-by: Liam R. Howlett Cc: Christophe Leroy Cc: Jeff Xu Cc: Linus Torvalds Cc: Nicholas Piggin Cc: Pedro Falcato Cc: David Hildenbrand Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- arch/powerpc/kernel/vdso.c | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index 220a76cae7c1..ee4b9d676cff 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -214,13 +214,6 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int /* Add required alignment. */ vdso_base = ALIGN(vdso_base, VDSO_ALIGNMENT); - /* - * Put vDSO base into mm struct. We need to do this before calling - * install_special_mapping or the perf counter mmap tracking code - * will fail to recognise it as a vDSO. - */ - mm->context.vdso = (void __user *)vdso_base + vvar_size; - vma = _install_special_mapping(mm, vdso_base, vvar_size, VM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP | VM_PFNMAP, &vvar_spec); @@ -240,10 +233,15 @@ static int __arch_setup_additional_pages(struct linux_binprm *bprm, int uses_int vma = _install_special_mapping(mm, vdso_base + vvar_size, vdso_size, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, vdso_spec); - if (IS_ERR(vma)) + if (IS_ERR(vma)) { do_munmap(mm, vdso_base, vvar_size, NULL); + return PTR_ERR(vma); + } - return PTR_ERR_OR_ZERO(vma); + // Now that the mappings are in place, set the mm VDSO pointer + mm->context.vdso = (void __user *)vdso_base + vvar_size; + + return 0; } int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) @@ -257,8 +255,6 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) return -EINTR; rc = __arch_setup_additional_pages(bprm, uses_interp); - if (rc) - mm->context.vdso = NULL; mmap_write_unlock(mm); return rc; From 497258dfafcc6b88e7c4c46a38a479721d44cf41 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 20 Aug 2024 15:14:47 -0700 Subject: [PATCH 119/416] mm: remove legacy install_special_mapping() code All relevant architectures had already been converted to the new interface (which just has an underscore in front of the name - not very imaginative naming), this just force-converts the stragglers. The modern interface is almost identical to the old one, except instead of the page pointer it takes a "struct vm_special_mapping" that describes the mapping (and contains the page pointer as one member), and it returns the resulting 'vma' instead of just the error code. Getting rid of the old interface also gets rid of some special casing, which had caused problems with the mremap extensions to "struct vm_special_mapping". [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/CAHk-=whvR+z=0=0gzgdfUiK70JTa-=+9vxD-4T=3BagXR6dciA@mail.gmail.comTested-by: Rob Landley # arch/sh/ Link: https://lore.kernel.org/all/20240819195120.GA1113263@thelio-3990X/ Signed-off-by: Linus Torvalds Cc: Nathan Chancellor Cc: Michael Ellerman Cc: Anton Ivanov Cc: Brian Cain Cc: Christophe Leroy Cc: Dinh Nguyen Cc: Guo Ren Cc: Jeff Xu Cc: Johannes Berg Cc: John Paul Adrian Glaubitz Cc: Liam R. Howlett Cc: Nicholas Piggin Cc: Pedro Falcato Cc: Richard Weinberger Cc: Rich Felker Cc: Rob Landley Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- arch/csky/kernel/vdso.c | 28 +++++++++++++++++++------- arch/hexagon/kernel/vdso.c | 14 +++++++++---- arch/nios2/mm/init.c | 12 +++++++---- arch/sh/kernel/vsyscall/vsyscall.c | 14 ++++++++++--- arch/x86/um/vdso/vma.c | 12 +++++++---- include/linux/mm.h | 4 ---- mm/mmap.c | 32 +++++------------------------- 7 files changed, 63 insertions(+), 53 deletions(-) diff --git a/arch/csky/kernel/vdso.c b/arch/csky/kernel/vdso.c index 2ca886e4a458..5c9ef63c29f1 100644 --- a/arch/csky/kernel/vdso.c +++ b/arch/csky/kernel/vdso.c @@ -45,9 +45,16 @@ arch_initcall(vdso_init); int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { + struct vm_area_struct *vma; struct mm_struct *mm = current->mm; unsigned long vdso_base, vdso_len; int ret; + static struct vm_special_mapping vdso_mapping = { + .name = "[vdso]", + }; + static struct vm_special_mapping vvar_mapping = { + .name = "[vvar]", + }; vdso_len = (vdso_pages + 1) << PAGE_SHIFT; @@ -65,22 +72,29 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, */ mm->context.vdso = (void *)vdso_base; - ret = - install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT, + vdso_mapping.pages = vdso_pagelist; + vma = + _install_special_mapping(mm, vdso_base, vdso_pages << PAGE_SHIFT, (VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC), - vdso_pagelist); + &vdso_mapping); - if (unlikely(ret)) { + if (IS_ERR(vma)) { + ret = PTR_ERR(vma); mm->context.vdso = NULL; goto end; } vdso_base += (vdso_pages << PAGE_SHIFT); - ret = install_special_mapping(mm, vdso_base, PAGE_SIZE, - (VM_READ | VM_MAYREAD), &vdso_pagelist[vdso_pages]); + vvar_mapping.pages = &vdso_pagelist[vdso_pages]; + vma = _install_special_mapping(mm, vdso_base, PAGE_SIZE, + (VM_READ | VM_MAYREAD), &vvar_mapping); - if (unlikely(ret)) + if (IS_ERR(vma)) { + ret = PTR_ERR(vma); mm->context.vdso = NULL; + goto end; + } + ret = 0; end: mmap_write_unlock(mm); return ret; diff --git a/arch/hexagon/kernel/vdso.c b/arch/hexagon/kernel/vdso.c index 2e4872d62124..6fd27ff1df73 100644 --- a/arch/hexagon/kernel/vdso.c +++ b/arch/hexagon/kernel/vdso.c @@ -51,7 +51,11 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { int ret; unsigned long vdso_base; + struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + static struct vm_special_mapping vdso_mapping = { + name = "[vdso]", + }; if (mmap_write_lock_killable(mm)) return -EINTR; @@ -66,16 +70,18 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) } /* MAYWRITE to allow gdb to COW and set breakpoints. */ - ret = install_special_mapping(mm, vdso_base, PAGE_SIZE, + vdso_mapping.pages = &vdso_page; + vma = _install_special_mapping(mm, vdso_base, PAGE_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - &vdso_page); + &vdso_mapping); - if (ret) + ret = PTR_ERR(vma); + if (IS_ERR(vma)) goto up_fail; mm->context.vdso = (void *)vdso_base; - + ret = 0; up_fail: mmap_write_unlock(mm); return ret; diff --git a/arch/nios2/mm/init.c b/arch/nios2/mm/init.c index 3459df28afee..a2278485de19 100644 --- a/arch/nios2/mm/init.c +++ b/arch/nios2/mm/init.c @@ -82,6 +82,10 @@ void __init mmu_init(void) pgd_t swapper_pg_dir[PTRS_PER_PGD] __aligned(PAGE_SIZE); pte_t invalid_pte_table[PTRS_PER_PTE] __aligned(PAGE_SIZE); static struct page *kuser_page[1]; +static struct vm_special_mapping vdso_mapping = { + .name = "[vdso]", + .pages = kuser_page, +}; static int alloc_kuser_page(void) { @@ -106,18 +110,18 @@ arch_initcall(alloc_kuser_page); int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { struct mm_struct *mm = current->mm; - int ret; + struct vm_area_struct *vma; mmap_write_lock(mm); /* Map kuser helpers to user space address */ - ret = install_special_mapping(mm, KUSER_BASE, KUSER_SIZE, + vma = _install_special_mapping(mm, KUSER_BASE, KUSER_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | - VM_MAYEXEC, kuser_page); + VM_MAYEXEC, &vdso_mapping); mmap_write_unlock(mm); - return ret; + return IS_ERR(vma) ? PTR_ERR(vma) : 0; } const char *arch_vma_name(struct vm_area_struct *vma) diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c index 1bd85a6949c4..add35c51e017 100644 --- a/arch/sh/kernel/vsyscall/vsyscall.c +++ b/arch/sh/kernel/vsyscall/vsyscall.c @@ -36,6 +36,10 @@ __setup("vdso=", vdso_setup); */ extern const char vsyscall_trapa_start, vsyscall_trapa_end; static struct page *syscall_pages[1]; +static struct vm_special_mapping vdso_mapping = { + .name = "[vdso]", + .pages = syscall_pages, +}; int __init vsyscall_init(void) { @@ -58,6 +62,7 @@ int __init vsyscall_init(void) int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; unsigned long addr; int ret; @@ -70,14 +75,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) goto up_fail; } - ret = install_special_mapping(mm, addr, PAGE_SIZE, + vdso_mapping.pages = syscall_pages; + vma = _install_special_mapping(mm, addr, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, - syscall_pages); - if (unlikely(ret)) + &vdso_mapping); + ret = PTR_ERR(vma); + if (IS_ERR(vma)) goto up_fail; current->mm->context.vdso = (void *)addr; + ret = 0; up_fail: mmap_write_unlock(mm); diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c index 76d9f6ce7a3d..f238f7b33cdd 100644 --- a/arch/x86/um/vdso/vma.c +++ b/arch/x86/um/vdso/vma.c @@ -52,8 +52,11 @@ subsys_initcall(init_vdso); int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { - int err; + struct vm_area_struct *vma; struct mm_struct *mm = current->mm; + static struct vm_special_mapping vdso_mapping = { + .name = "[vdso]", + }; if (!vdso_enabled) return 0; @@ -61,12 +64,13 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) if (mmap_write_lock_killable(mm)) return -EINTR; - err = install_special_mapping(mm, um_vdso_addr, PAGE_SIZE, + vdso_mapping.pages = vdsop; + vma = _install_special_mapping(mm, um_vdso_addr, PAGE_SIZE, VM_READ|VM_EXEC| VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, - vdsop); + &vdso_mapping); mmap_write_unlock(mm); - return err; + return IS_ERR(vma) ? PTR_ERR(vma) : 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index b1eed30fdc06..0fde51c0532f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3263,10 +3263,6 @@ extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm, unsigned long addr, unsigned long len, unsigned long flags, const struct vm_special_mapping *spec); -/* This is an obsolete alternative to _install_special_mapping. */ -extern int install_special_mapping(struct mm_struct *mm, - unsigned long addr, unsigned long len, - unsigned long flags, struct page **pages); unsigned long randomize_stack_top(unsigned long stack_top); unsigned long randomize_page(unsigned long start, unsigned long range); diff --git a/mm/mmap.c b/mm/mmap.c index 3af256bacef3..7cba84f8e3a5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2094,27 +2094,17 @@ static const struct vm_operations_struct special_mapping_vmops = { .may_split = special_mapping_split, }; -static const struct vm_operations_struct legacy_special_mapping_vmops = { - .close = special_mapping_close, - .fault = special_mapping_fault, -}; - static vm_fault_t special_mapping_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; pgoff_t pgoff; struct page **pages; + struct vm_special_mapping *sm = vma->vm_private_data; - if (vma->vm_ops == &legacy_special_mapping_vmops) { - pages = vma->vm_private_data; - } else { - struct vm_special_mapping *sm = vma->vm_private_data; + if (sm->fault) + return sm->fault(sm, vmf->vma, vmf); - if (sm->fault) - return sm->fault(sm, vmf->vma, vmf); - - pages = sm->pages; - } + pages = sm->pages; for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) pgoff--; @@ -2169,8 +2159,7 @@ bool vma_is_special_mapping(const struct vm_area_struct *vma, const struct vm_special_mapping *sm) { return vma->vm_private_data == sm && - (vma->vm_ops == &special_mapping_vmops || - vma->vm_ops == &legacy_special_mapping_vmops); + vma->vm_ops == &special_mapping_vmops; } /* @@ -2191,17 +2180,6 @@ struct vm_area_struct *_install_special_mapping( &special_mapping_vmops); } -int install_special_mapping(struct mm_struct *mm, - unsigned long addr, unsigned long len, - unsigned long vm_flags, struct page **pages) -{ - struct vm_area_struct *vma = __install_special_mapping( - mm, addr, len, vm_flags, (void *)pages, - &legacy_special_mapping_vmops); - - return PTR_ERR_OR_ZERO(vma); -} - /* * initialise the percpu counter for VM */ From 90a6f2a8f4423ff4ae17396c7933983926d9e655 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Tue, 13 Aug 2024 14:53:58 -0700 Subject: [PATCH 120/416] memcg: use ratelimited stats flush in the reclaim The Meta prod is seeing large amount of stalls in memcg stats flush from the memcg reclaim code path. At the moment, this specific callsite is doing a synchronous memcg stats flush. The rstat flush is an expensive and time consuming operation, so concurrent relaimers will busywait on the lock potentially for a long time. Actually this issue is not unique to Meta and has been observed by Cloudflare [1] as well. For the Cloudflare case, the stalls were due to contention between kswapd threads running on their 8 numa node machines which does not make sense as rstat flush is global and flush from one kswapd thread should be sufficient for all. Simply replace the synchronous flush with the ratelimited one. One may raise a concern on potentially using 2 sec stale (at worst) stats for heuristics like desirable inactive:active ratio and preferring inactive file pages over anon pages but these specific heuristics do not require very precise stats and also are ignored under severe memory pressure. More specifically for this code path, the stats are needed for two specific heuristics: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. This patch has been running on Meta fleet for several months and we have not observed any issues. Please note that MGLRU is not impacted by this issue at all as it avoids rstat flushing completely. Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] Link: https://lkml.kernel.org/r/20240813215358.2259750-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Jesper Dangaard Brouer Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Yosry Ahmed Cc: Yu Zhao Cc: Nhat Pham Signed-off-by: Andrew Morton --- mm/vmscan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 75a55a855fef..243b62b9f89d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2264,10 +2264,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); /* - * Flush the memory cgroup stats, so that we read accurate per-memcg - * lruvec stats for heuristics. + * Flush the memory cgroup stats in rate-limited way as we don't need + * most accurate stats here. We may switch to regular stats flushing + * in the future once it is cheap enough. */ - mem_cgroup_flush_stats(sc->target_mem_cgroup); + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); /* * Determine the scan balance between anon and file LRUs. From 02f4bbefcada38cab86b365e5c795e0f858356bc Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 14 Aug 2024 17:34:15 +0800 Subject: [PATCH 121/416] mm: kmem: add lockdep assertion to obj_cgroup_memcg obj_cgroup_memcg() is supposed to safe to prevent the returned memory cgroup from being freed only when the caller is holding the rcu read lock or objcg_lock or cgroup_mutex. It is very easy to ignore thoes conditions when users call some upper APIs which call obj_cgroup_memcg() internally like mem_cgroup_from_slab_obj() (See the link below). So it is better to add lockdep assertion to obj_cgroup_memcg() to find those issues ASAP. Because there is no user of obj_cgroup_memcg() holding objcg_lock to make the returned memory cgroup safe, do not add objcg_lock assertion (We should export objcg_lock if we really want to do). Additionally, this is some internal implementation detail of memcg and should not be accessible outside memcg code. Some users like __mem_cgroup_uncharge() do not care the lifetime of the returned memory cgroup, which just want to know if the folio is charged to a memory cgroup, therefore, they do not need to hold the needed locks. In which case, introduce a new helper folio_memcg_charged() to do this. Compare it to folio_memcg(), it could eliminate a memory access of objcg->memcg for kmem, actually, a really small gain. [songmuchun@bytedance.com: fix split_page_memcg()] Link: https://lkml.kernel.org/r/20240819080415.44964-1-songmuchun@bytedance.com Link: https://lore.kernel.org/all/20240718083607.42068-1-songmuchun@bytedance.com/ Link: https://lkml.kernel.org/r/20240814093415.17634-1-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 20 +++++++++++++++++--- mm/memcontrol.c | 11 +++++------ 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1b79760af685..07eadf7ecbba 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -366,11 +366,11 @@ static inline bool folio_memcg_kmem(struct folio *folio); * After the initialization objcg->memcg is always pointing at * a valid memcg, but can be atomically swapped to the parent memcg. * - * The caller must ensure that the returned memcg won't be released: - * e.g. acquire the rcu_read_lock or css_set_lock. + * The caller must ensure that the returned memcg won't be released. */ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) { + lockdep_assert_once(rcu_read_lock_held() || lockdep_is_held(&cgroup_mutex)); return READ_ONCE(objcg->memcg); } @@ -444,6 +444,19 @@ static inline struct mem_cgroup *folio_memcg(struct folio *folio) return __folio_memcg(folio); } +/* + * folio_memcg_charged - If a folio is charged to a memory cgroup. + * @folio: Pointer to the folio. + * + * Returns true if folio is charged to a memory cgroup, otherwise returns false. + */ +static inline bool folio_memcg_charged(struct folio *folio) +{ + if (folio_memcg_kmem(folio)) + return __folio_objcg(folio) != NULL; + return __folio_memcg(folio) != NULL; +} + /** * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio. * @folio: Pointer to the folio. @@ -460,7 +473,6 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) unsigned long memcg_data = READ_ONCE(folio->memcg_data); VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); - WARN_ON_ONCE(!rcu_read_lock_held()); if (memcg_data & MEMCG_DATA_KMEM) { struct obj_cgroup *objcg; @@ -469,6 +481,8 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) return obj_cgroup_memcg(objcg); } + WARN_ON_ONCE(!rcu_read_lock_held()); + return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cb2996d5d3f4..96f85fd0e5b3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2374,7 +2374,7 @@ void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) { - VM_BUG_ON_FOLIO(folio_memcg(folio), folio); + VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio); /* * Any of the following ensures page's memcg stability: * @@ -3035,12 +3035,11 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void split_page_memcg(struct page *head, int old_order, int new_order) { struct folio *folio = page_folio(head); - struct mem_cgroup *memcg = folio_memcg(folio); int i; unsigned int old_nr = 1 << old_order; unsigned int new_nr = 1 << new_order; - if (mem_cgroup_disabled() || !memcg) + if (mem_cgroup_disabled() || !folio_memcg_charged(folio)) return; for (i = new_nr; i < old_nr; i += new_nr) @@ -3049,7 +3048,7 @@ void split_page_memcg(struct page *head, int old_order, int new_order) if (folio_memcg_kmem(folio)) obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1); else - css_get_many(&memcg->css, old_nr / new_nr - 1); + css_get_many(&folio_memcg(folio)->css, old_nr / new_nr - 1); } unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) @@ -4707,7 +4706,7 @@ void __mem_cgroup_uncharge(struct folio *folio) struct uncharge_gather ug; /* Don't touch folio->lru of any random page, pre-check: */ - if (!folio_memcg(folio)) + if (!folio_memcg_charged(folio)) return; uncharge_gather_clear(&ug); @@ -4752,7 +4751,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) return; /* Page cache replacement: new folio already charged? */ - if (folio_memcg(new)) + if (folio_memcg_charged(new)) return; memcg = folio_memcg(old); From bd164d81a767f33c30d5c5b50b775631699ee976 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:28 -0400 Subject: [PATCH 122/416] maple_tree: introduce store_type enum Patch series "Introduce a store type enum for the Maple tree", v4. ================================ OVERVIEW ================================ This series implements two work items[3]: "aligning mas_store_gfp() with mas_preallocate()" and "enum for store type". mas_store_gfp() is modified to preallocate nodes. This simplies many of the write helper functions by allowing them to use mas_store_gfp() rather than open coding node allocation and error handling. The enum defines the following store types: enum store_type { wr_invalid, wr_new_root, wr_store_root, wr_exact_fit, wr_spanning_store, wr_split_store, wr_rebalance, wr_append, wr_node_store, wr_slot_store, }; In the current maple tree code, a walk down the tree is done in mas_preallocate() to determine the number of nodes needed for this write. After node allocation, mas_wr_store_entry() will perform another walk to determine which write helper function to use to complete the write. Rather than performing the second walk, we can store the type of write in the maple write state during node allocation and read this field to complete the write. Patches 1-16 implement this store type feature. Patch 17 is a cleanup patch to change functions that have unused return types to be void. ================================ RESULTS ================================= Phoronix t-test-1 (Seconds < Lower Is Better) v6.10-rc6 Threads: 1 33.15 Threads: 2 10.81 v6.10-rc6 + this series Threads: 1 32.69 Threads: 2 10.45 Stress-ng mmap 6.10_base store_type_v4 Duration User 2744.65 2769.40 Duration System 10862.69 10817.59 Duration Elapsed 1477.58 1478.35 ================================ TESTING ================================= Testing was done with the maple tree test suite. A new test case is also added to validate the order in which we test for and assign the store type. [1]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#m81044feb66765265f8ca7f21e4b4b3725b18780a [2]: https://lore.kernel.org/linux-mm/80926b22-a8d2-9992-eb5e-27e2c99cf460@google.com/T/#mb36c6526486638e82518c0f37a428fb279c84d8a [3]: https://lists.infradead.org/pipermail/maple-tree/2023-December/003098.html This patch (of 17): Add a store_type enum that is stored in ma_state. This will be used to keep track of partial walks of the tree so that subsequent walks can pick up where a previous walk left off. Link: https://lkml.kernel.org/r/20240814161944.55347-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20240814161944.55347-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index a53ad4dabd7e..8e1504a81cd2 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -148,6 +148,18 @@ enum maple_type { maple_arange_64, }; +enum store_type { + wr_invalid, + wr_new_root, + wr_store_root, + wr_exact_fit, + wr_spanning_store, + wr_split_store, + wr_rebalance, + wr_append, + wr_node_store, + wr_slot_store, +}; /** * DOC: Maple tree flags @@ -436,6 +448,7 @@ struct ma_state { unsigned char offset; unsigned char mas_flags; unsigned char end; /* The end of the node */ + enum store_type store_type; /* The type of store needed for this operation */ }; struct ma_wr_state { @@ -477,6 +490,7 @@ struct ma_wr_state { .max = ULONG_MAX, \ .alloc = NULL, \ .mas_flags = 0, \ + .store_type = wr_invalid, \ } #define MA_WR_STATE(name, ma_state, wr_entry) \ From 19138a2cc1ad82b74fb720411a2054db0522be2d Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:29 -0400 Subject: [PATCH 123/416] maple_tree: introduce mas_wr_prealloc_setup() Introduce a helper function, mas_wr_prealoc_setup(), that will set up a maple write state in order to start a walk of a maple tree. Link: https://lkml.kernel.org/r/20240814161944.55347-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 884e2d130876..407c0be6e42f 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5399,6 +5399,13 @@ reset: mas_reset(wr_mas->mas); } +static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) +{ + struct ma_state *mas = wr_mas->mas; + + mas_wr_store_setup(wr_mas); + wr_mas->content = mas_start(mas); +} /* Interface */ /** @@ -5509,8 +5516,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) if (unlikely(!mas->index && mas->last == ULONG_MAX)) goto ask_now; - mas_wr_store_setup(&wr_mas); - wr_mas.content = mas_start(mas); + mas_wr_prealloc_setup(&wr_mas); /* Root expand */ if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) goto ask_now; From 3cc6f42a53f79f8da52b252859145b6bf8dfe12e Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:30 -0400 Subject: [PATCH 124/416] maple_tree: move up mas_wr_store_setup() and mas_wr_prealloc_setup() Subsequent patches require these definitions to be higher, no functional changes intended. Link: https://lkml.kernel.org/r/20240814161944.55347-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 96 ++++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 48 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 407c0be6e42f..de4a91ced8ca 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4227,6 +4227,54 @@ static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) mas_wr_modify(wr_mas); } +static void mas_wr_store_setup(struct ma_wr_state *wr_mas) +{ + if (!mas_is_active(wr_mas->mas)) { + if (mas_is_start(wr_mas->mas)) + return; + + if (unlikely(mas_is_paused(wr_mas->mas))) + goto reset; + + if (unlikely(mas_is_none(wr_mas->mas))) + goto reset; + + if (unlikely(mas_is_overflow(wr_mas->mas))) + goto reset; + + if (unlikely(mas_is_underflow(wr_mas->mas))) + goto reset; + } + + /* + * A less strict version of mas_is_span_wr() where we allow spanning + * writes within this node. This is to stop partial walks in + * mas_prealloc() from being reset. + */ + if (wr_mas->mas->last > wr_mas->mas->max) + goto reset; + + if (wr_mas->entry) + return; + + if (mte_is_leaf(wr_mas->mas->node) && + wr_mas->mas->last == wr_mas->mas->max) + goto reset; + + return; + +reset: + mas_reset(wr_mas->mas); +} + +static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) +{ + struct ma_state *mas = wr_mas->mas; + + mas_wr_store_setup(wr_mas); + wr_mas->content = mas_start(mas); +} + /** * mas_insert() - Internal call to insert a value * @mas: The maple state @@ -5358,54 +5406,6 @@ static inline void mte_destroy_walk(struct maple_enode *enode, mt_destroy_walk(enode, mt, true); } } - -static void mas_wr_store_setup(struct ma_wr_state *wr_mas) -{ - if (!mas_is_active(wr_mas->mas)) { - if (mas_is_start(wr_mas->mas)) - return; - - if (unlikely(mas_is_paused(wr_mas->mas))) - goto reset; - - if (unlikely(mas_is_none(wr_mas->mas))) - goto reset; - - if (unlikely(mas_is_overflow(wr_mas->mas))) - goto reset; - - if (unlikely(mas_is_underflow(wr_mas->mas))) - goto reset; - } - - /* - * A less strict version of mas_is_span_wr() where we allow spanning - * writes within this node. This is to stop partial walks in - * mas_prealloc() from being reset. - */ - if (wr_mas->mas->last > wr_mas->mas->max) - goto reset; - - if (wr_mas->entry) - return; - - if (mte_is_leaf(wr_mas->mas->node) && - wr_mas->mas->last == wr_mas->mas->max) - goto reset; - - return; - -reset: - mas_reset(wr_mas->mas); -} - -static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) -{ - struct ma_state *mas = wr_mas->mas; - - mas_wr_store_setup(wr_mas); - wr_mas->content = mas_start(mas); -} /* Interface */ /** From 5d659bbb52a24f91cc6c0cf546ffcc0faa7e8f1a Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:31 -0400 Subject: [PATCH 125/416] maple_tree: introduce mas_wr_store_type() Introduce mas_wr_store_type() which will set the correct store type based on a walk of the tree. In mas_wr_node_store() the <= min_slots condition is changed to < as if new_end is = to mt_min_slots then there is not enough room. mas_prealloc_calc() is also introduced to abstract the calculation used to determine the number of nodes needed for a store operation. In this change a call to mas_reset() is removed in the error case of mas_prealloc(). This is only needed in the MA_STATE_REBALANCE case of mas_destroy(). We can move the call to mas_reset() directly to mas_destroy(). Also, add a test case to validate the order that we check the store type in is correct. This test models a vma expanding and then shrinking which is part of the boot process. Link: https://lkml.kernel.org/r/20240814161944.55347-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 219 ++++++++++++++++++++++--------- tools/testing/radix-tree/maple.c | 36 +++++ 2 files changed, 194 insertions(+), 61 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index de4a91ced8ca..d0b9b3795b96 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -1372,9 +1372,9 @@ retry: return NULL; } + mas->node = NULL; /* empty tree */ if (unlikely(!root)) { - mas->node = NULL; mas->status = ma_none; mas->offset = MAPLE_NODE_SLOTS; return NULL; @@ -3890,7 +3890,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas, bool in_rcu = mt_in_rcu(mas->tree); /* Check if there is enough data. The room is enough. */ - if (!mte_is_root(mas->node) && (new_end <= mt_min_slots[wr_mas->type]) && + if (!mte_is_root(mas->node) && (new_end < mt_min_slots[wr_mas->type]) && !(mas->mas_flags & MA_STATE_BULK)) return false; @@ -4275,6 +4275,146 @@ static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) wr_mas->content = mas_start(mas); } +/** + * mas_prealloc_calc() - Calculate number of nodes needed for a + * given store oepration + * @mas: The maple state + * @entry: The entry to store into the tree + * + * Return: Number of nodes required for preallocation. + */ +static inline int mas_prealloc_calc(struct ma_state *mas, void *entry) +{ + int ret = mas_mt_height(mas) * 3 + 1; + + switch (mas->store_type) { + case wr_invalid: + WARN_ON_ONCE(1); + break; + case wr_new_root: + ret = 1; + break; + case wr_store_root: + if (likely((mas->last != 0) || (mas->index != 0))) + ret = 1; + else if (((unsigned long) (entry) & 3) == 2) + ret = 1; + else + ret = 0; + break; + case wr_spanning_store: + ret = mas_mt_height(mas) * 3 + 1; + break; + case wr_split_store: + ret = mas_mt_height(mas) * 2 + 1; + break; + case wr_rebalance: + ret = mas_mt_height(mas) * 2 - 1; + break; + case wr_node_store: + ret = mt_in_rcu(mas->tree) ? 1 : 0; + break; + case wr_append: + case wr_exact_fit: + case wr_slot_store: + ret = 0; + } + + return ret; +} + +/* + * mas_wr_store_type() - Set the store type for a given + * store operation. + * @wr_mas: The maple write state + */ +static inline void mas_wr_store_type(struct ma_wr_state *wr_mas) +{ + struct ma_state *mas = wr_mas->mas; + unsigned char new_end; + + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) { + mas->store_type = wr_store_root; + return; + } + + if (unlikely(!mas_wr_walk(wr_mas))) { + mas->store_type = wr_spanning_store; + return; + } + + /* At this point, we are at the leaf node that needs to be altered. */ + mas_wr_end_piv(wr_mas); + if (!wr_mas->entry) + mas_wr_extend_null(wr_mas); + + new_end = mas_wr_new_end(wr_mas); + if ((wr_mas->r_min == mas->index) && (wr_mas->r_max == mas->last)) { + mas->store_type = wr_exact_fit; + return; + } + + if (unlikely(!mas->index && mas->last == ULONG_MAX)) { + mas->store_type = wr_new_root; + return; + } + + /* Potential spanning rebalance collapsing a node */ + if (new_end < mt_min_slots[wr_mas->type]) { + if (!mte_is_root(mas->node)) { + mas->store_type = wr_rebalance; + return; + } + mas->store_type = wr_node_store; + return; + } + + if (new_end >= mt_slots[wr_mas->type]) { + mas->store_type = wr_split_store; + return; + } + + if (!mt_in_rcu(mas->tree) && (mas->offset == mas->end)) { + mas->store_type = wr_append; + return; + } + + if ((new_end == mas->end) && (!mt_in_rcu(mas->tree) || + (wr_mas->offset_end - mas->offset == 1))) { + mas->store_type = wr_slot_store; + return; + } + + if (mte_is_root(mas->node) || (new_end >= mt_min_slots[wr_mas->type]) || + (mas->mas_flags & MA_STATE_BULK)) { + mas->store_type = wr_node_store; + return; + } + + mas->store_type = wr_invalid; + MAS_WARN_ON(mas, 1); +} + +/** + * mas_wr_preallocate() - Preallocate enough nodes for a store operation + * @wr_mas: The maple write state + * @entry: The entry that will be stored + * + */ +static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry) +{ + struct ma_state *mas = wr_mas->mas; + int request; + + mas_wr_prealloc_setup(wr_mas); + mas_wr_store_type(wr_mas); + request = mas_prealloc_calc(mas, entry); + if (!request) + return; + + mas_node_count(mas, request); +} + /** * mas_insert() - Internal call to insert a value * @mas: The maple state @@ -5508,69 +5648,25 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc); int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) { MA_WR_STATE(wr_mas, mas, entry); - unsigned char node_size; - int request = 1; - int ret; - - - if (unlikely(!mas->index && mas->last == ULONG_MAX)) - goto ask_now; + int ret = 0; + int request; mas_wr_prealloc_setup(&wr_mas); - /* Root expand */ - if (unlikely(mas_is_none(mas) || mas_is_ptr(mas))) - goto ask_now; + mas_wr_store_type(&wr_mas); + request = mas_prealloc_calc(mas, entry); + if (!request) + return ret; - if (unlikely(!mas_wr_walk(&wr_mas))) { - /* Spanning store, use worst case for now */ - request = 1 + mas_mt_height(mas) * 3; - goto ask_now; - } - - /* At this point, we are at the leaf node that needs to be altered. */ - /* Exact fit, no nodes needed. */ - if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last) - return 0; - - mas_wr_end_piv(&wr_mas); - node_size = mas_wr_new_end(&wr_mas); - - /* Slot store, does not require additional nodes */ - if (node_size == mas->end) { - /* reuse node */ - if (!mt_in_rcu(mas->tree)) - return 0; - /* shifting boundary */ - if (wr_mas.offset_end - mas->offset == 1) - return 0; - } - - if (node_size >= mt_slots[wr_mas.type]) { - /* Split, worst case for now. */ - request = 1 + mas_mt_height(mas) * 2; - goto ask_now; - } - - /* New root needs a single node */ - if (unlikely(mte_is_root(mas->node))) - goto ask_now; - - /* Potential spanning rebalance collapsing a node, use worst-case */ - if (node_size - 1 <= mt_min_slots[wr_mas.type]) - request = mas_mt_height(mas) * 2 - 1; - - /* node store, slot store needs one node */ -ask_now: mas_node_count_gfp(mas, request, gfp); - mas->mas_flags |= MA_STATE_PREALLOC; - if (likely(!mas_is_err(mas))) - return 0; + if (mas_is_err(mas)) { + mas_set_alloc_req(mas, 0); + ret = xa_err(mas->node); + mas_destroy(mas); + mas_reset(mas); + return ret; + } - mas_set_alloc_req(mas, 0); - ret = xa_err(mas->node); - mas_reset(mas); - mas_destroy(mas); - mas_reset(mas); + mas->mas_flags |= MA_STATE_PREALLOC; return ret; } EXPORT_SYMBOL_GPL(mas_preallocate); @@ -5596,7 +5692,8 @@ void mas_destroy(struct ma_state *mas) */ if (mas->mas_flags & MA_STATE_REBALANCE) { unsigned char end; - + if (mas_is_err(mas)) + mas_reset(mas); mas_start(mas); mtree_range_walk(mas); end = mas->end + 1; diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index ef5b83cf94ea..ad42a36231fb 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -36283,6 +36283,38 @@ static void check_nomem_writer_race(struct maple_tree *mt) mtree_unlock(mt); } + /* test to simulate expanding a vma from [0x7fffffffe000, 0x7ffffffff000) + * to [0x7ffde4ca1000, 0x7ffffffff000) and then shrinking the vma to + * [0x7ffde4ca1000, 0x7ffde4ca2000) + */ +static inline int check_vma_modification(struct maple_tree *mt) +{ + MA_STATE(mas, mt, 0, 0); + + mtree_lock(mt); + /* vma with old start and old end */ + __mas_set_range(&mas, 0x7fffffffe000, 0x7ffffffff000 - 1); + mas_preallocate(&mas, xa_mk_value(1), GFP_KERNEL); + mas_store_prealloc(&mas, xa_mk_value(1)); + + /* next write occurs partly in previous range [0, 0x7fffffffe000)*/ + mas_prev_range(&mas, 0); + /* expand vma to {0x7ffde4ca1000, 0x7ffffffff000) */ + __mas_set_range(&mas, 0x7ffde4ca1000, 0x7ffffffff000 - 1); + mas_preallocate(&mas, xa_mk_value(1), GFP_KERNEL); + mas_store_prealloc(&mas, xa_mk_value(1)); + + /* shrink vma to [0x7ffde4ca1000, 7ffde4ca2000) */ + __mas_set_range(&mas, 0x7ffde4ca2000, 0x7ffffffff000 - 1); + mas_preallocate(&mas, NULL, GFP_KERNEL); + mas_store_prealloc(&mas, NULL); + mt_dump(mt, mt_dump_hex); + + mas_destroy(&mas); + mtree_unlock(mt); + return 0; +} + void farmer_tests(void) { struct maple_node *node; @@ -36290,6 +36322,10 @@ void farmer_tests(void) mt_dump(&tree, mt_dump_dec); + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU); + check_vma_modification(&tree); + mtree_destroy(&tree); + tree.ma_root = xa_mk_value(0); mt_dump(&tree, mt_dump_dec); From 3cd9e92e009de9a036cbf9ec27bad33367fca6f3 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:32 -0400 Subject: [PATCH 126/416] maple_tree: remove mas_destroy() from mas_nomem() Separate call to mas_destroy() from mas_nomem() so we can check for no memory errors without destroying the current maple state in mas_store_gfp(). We then add calls to mas_destroy() to callers of mas_nomem(). Link: https://lkml.kernel.org/r/20240814161944.55347-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 39 ++++++++++++++++++++------------ tools/testing/radix-tree/maple.c | 10 ++++---- 2 files changed, 31 insertions(+), 18 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index d0b9b3795b96..58985107cf00 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4519,6 +4519,7 @@ int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp, if (*next == 0) mas->tree->ma_flags |= MT_FLAGS_ALLOC_WRAPPED; + mas_destroy(mas); return ret; } EXPORT_SYMBOL(mas_alloc_cyclic); @@ -5601,21 +5602,25 @@ int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp) unsigned long index = mas->index; unsigned long last = mas->last; MA_WR_STATE(wr_mas, mas, entry); + int ret = 0; - mas_wr_store_setup(&wr_mas); - trace_ma_write(__func__, mas, 0, entry); retry: - mas_wr_store_entry(&wr_mas); + mas_wr_preallocate(&wr_mas, entry); if (unlikely(mas_nomem(mas, gfp))) { if (!entry) __mas_set_range(mas, index, last); goto retry; } - if (unlikely(mas_is_err(mas))) - return xa_err(mas->node); + if (mas_is_err(mas)) { + ret = xa_err(mas->node); + goto out; + } - return 0; + mas_wr_store_entry(&wr_mas); +out: + mas_destroy(mas); + return ret; } EXPORT_SYMBOL_GPL(mas_store_gfp); @@ -6374,6 +6379,7 @@ write_retry: goto write_retry; } + mas_destroy(mas); return entry; } EXPORT_SYMBOL_GPL(mas_erase); @@ -6388,10 +6394,8 @@ EXPORT_SYMBOL_GPL(mas_erase); bool mas_nomem(struct ma_state *mas, gfp_t gfp) __must_hold(mas->tree->ma_lock) { - if (likely(mas->node != MA_ERROR(-ENOMEM))) { - mas_destroy(mas); + if (likely(mas->node != MA_ERROR(-ENOMEM))) return false; - } if (gfpflags_allow_blocking(gfp) && !mt_external_lock(mas->tree)) { mtree_unlock(mas->tree); @@ -6469,6 +6473,7 @@ int mtree_store_range(struct maple_tree *mt, unsigned long index, { MA_STATE(mas, mt, index, last); MA_WR_STATE(wr_mas, &mas, entry); + int ret = 0; trace_ma_write(__func__, &mas, 0, entry); if (WARN_ON_ONCE(xa_is_advanced(entry))) @@ -6484,10 +6489,12 @@ retry: goto retry; mtree_unlock(mt); - if (mas_is_err(&mas)) - return xa_err(mas.node); - return 0; + if (mas_is_err(&mas)) + ret = xa_err(mas.node); + + mas_destroy(&mas); + return ret; } EXPORT_SYMBOL(mtree_store_range); @@ -6523,6 +6530,7 @@ int mtree_insert_range(struct maple_tree *mt, unsigned long first, unsigned long last, void *entry, gfp_t gfp) { MA_STATE(ms, mt, first, last); + int ret = 0; if (WARN_ON_ONCE(xa_is_advanced(entry))) return -EINVAL; @@ -6538,9 +6546,10 @@ retry: mtree_unlock(mt); if (mas_is_err(&ms)) - return xa_err(ms.node); + ret = xa_err(ms.node); - return 0; + mas_destroy(&ms); + return ret; } EXPORT_SYMBOL(mtree_insert_range); @@ -6595,6 +6604,7 @@ retry: unlock: mtree_unlock(mt); + mas_destroy(&mas); return ret; } EXPORT_SYMBOL(mtree_alloc_range); @@ -6676,6 +6686,7 @@ retry: unlock: mtree_unlock(mt); + mas_destroy(&mas); return ret; } EXPORT_SYMBOL(mtree_alloc_rrange); diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index ad42a36231fb..c5b00aca9def 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -120,7 +120,7 @@ static noinline void __init check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, mas.alloc->slot[0] == NULL); mas_push_node(&mas, mn); mas_reset(&mas); - mas_nomem(&mas, GFP_KERNEL); /* free */ + mas_destroy(&mas); mtree_unlock(mt); @@ -144,7 +144,7 @@ static noinline void __init check_new_node(struct maple_tree *mt) mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); mas.status = ma_start; - mas_nomem(&mas, GFP_KERNEL); + mas_destroy(&mas); /* Allocate 3 nodes, will fail. */ mas_node_count(&mas, 3); /* Drop the lock and allocate 3 nodes. */ @@ -161,7 +161,7 @@ static noinline void __init check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, mas_allocated(&mas) != 3); /* Free. */ mas_reset(&mas); - mas_nomem(&mas, GFP_KERNEL); + mas_destroy(&mas); /* Set allocation request to 1. */ mas_set_alloc_req(&mas, 1); @@ -277,6 +277,7 @@ static noinline void __init check_new_node(struct maple_tree *mt) } mas_reset(&mas); MT_BUG_ON(mt, mas_nomem(&mas, GFP_KERNEL)); + mas_destroy(&mas); } @@ -299,7 +300,7 @@ static noinline void __init check_new_node(struct maple_tree *mt) } MT_BUG_ON(mt, mas_allocated(&mas) != total); mas_reset(&mas); - mas_nomem(&mas, GFP_KERNEL); /* Free. */ + mas_destroy(&mas); /* Free. */ MT_BUG_ON(mt, mas_allocated(&mas) != 0); for (i = 1; i < 128; i++) { @@ -35847,6 +35848,7 @@ static noinline void __init check_nomem(struct maple_tree *mt) mas_store(&ms, &ms); /* insert 1 -> &ms */ mas_nomem(&ms, GFP_KERNEL); /* Node allocated in here. */ mtree_unlock(mt); + mas_destroy(&ms); mtree_destroy(mt); } From 7e093834ed8c71361bae556f2c1efc2bf2e33971 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:33 -0400 Subject: [PATCH 127/416] maple_tree: preallocate nodes in mas_erase() Use mas_wr_preallocate() in mas_erase() to preallocate enough nodes to complete the erase. Add error handling by skipping the store if the preallocation lead to some error besides no memory. Link: https://lkml.kernel.org/r/20240814161944.55347-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 58985107cf00..8ba52ffa778e 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -6371,14 +6371,18 @@ write_retry: /* Must reset to ensure spanning writes of last slot are detected */ mas_reset(mas); - mas_wr_store_setup(&wr_mas); - mas_wr_store_entry(&wr_mas); + mas_wr_preallocate(&wr_mas, NULL); if (mas_nomem(mas, GFP_KERNEL)) { /* in case the range of entry changed when unlocked */ mas->index = mas->last = index; goto write_retry; } + if (mas_is_err(mas)) + goto out; + + mas_wr_store_entry(&wr_mas); +out: mas_destroy(mas); return entry; } From 85db8f241707778b9755618f350915616d1d861c Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:34 -0400 Subject: [PATCH 128/416] maple_tree: use mas_store_gfp() in mtree_store_range() Refactor mtree_store_range() to use mas_store_gfp() which will abstract the store, memory allocation, and error handling. Link: https://lkml.kernel.org/r/20240814161944.55347-8-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 8ba52ffa778e..e01e05be6301 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -6476,7 +6476,6 @@ int mtree_store_range(struct maple_tree *mt, unsigned long index, unsigned long last, void *entry, gfp_t gfp) { MA_STATE(mas, mt, index, last); - MA_WR_STATE(wr_mas, &mas, entry); int ret = 0; trace_ma_write(__func__, &mas, 0, entry); @@ -6487,17 +6486,9 @@ int mtree_store_range(struct maple_tree *mt, unsigned long index, return -EINVAL; mtree_lock(mt); -retry: - mas_wr_store_entry(&wr_mas); - if (mas_nomem(&mas, gfp)) - goto retry; - + ret = mas_store_gfp(&mas, entry, gfp); mtree_unlock(mt); - if (mas_is_err(&mas)) - ret = xa_err(mas.node); - - mas_destroy(&mas); return ret; } EXPORT_SYMBOL(mtree_store_range); From 23e217a848b3b9e1b0cb8c41d731d7968bcc77a5 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:35 -0400 Subject: [PATCH 129/416] maple_tree: print store type in mas_dump() Knowing the store type of the maple state could be helpful for debugging. Have mas_dump() print mas->store_type. Link: https://lkml.kernel.org/r/20240814161944.55347-9-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index e01e05be6301..a1689fc6227b 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -7760,6 +7760,40 @@ void mas_dump(const struct ma_state *mas) break; } + pr_err("Store Type: "); + switch (mas->store_type) { + case wr_invalid: + pr_err("invalid store type\n"); + break; + case wr_new_root: + pr_err("new_root\n"); + break; + case wr_store_root: + pr_err("store_root\n"); + break; + case wr_exact_fit: + pr_err("exact_fit\n"); + break; + case wr_split_store: + pr_err("split_store\n"); + break; + case wr_slot_store: + pr_err("slot_store\n"); + break; + case wr_append: + pr_err("append\n"); + break; + case wr_node_store: + pr_err("node_store\n"); + break; + case wr_spanning_store: + pr_err("spanning_store\n"); + break; + case wr_rebalance: + pr_err("rebalance\n"); + break; + } + pr_err("[%u/%u] index=%lx last=%lx\n", mas->offset, mas->end, mas->index, mas->last); pr_err(" min=%lx max=%lx alloc=%p, depth=%u, flags=%x\n", From 580fcbd67ce2cdf9d70c7a8eda0789b9eba1c3af Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:36 -0400 Subject: [PATCH 130/416] maple_tree: use store type in mas_wr_store_entry() When storing an entry, we can read the store type that was set from a previous partial walk of the tree. Now that the type of store is known, select the correct write helper function to use to complete the store. Also noinline mas_wr_spanning_store() to limit stack frame usage in mas_wr_store_entry() as it allocates a maple_big_node on the stack. Link: https://lkml.kernel.org/r/20240814161944.55347-10-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 94 +++++++++++++++++++++++++++--------------------- 1 file changed, 54 insertions(+), 40 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index a1689fc6227b..2242e07a46dc 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -3780,7 +3780,7 @@ done: * * Return: 0 on error, positive on success. */ -static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) +static noinline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) { struct maple_subtree_state mast; struct maple_big_node b_node; @@ -4206,43 +4206,62 @@ slow_path: static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; + unsigned char new_end = mas_wr_new_end(wr_mas); - wr_mas->content = mas_start(mas); - if (mas_is_none(mas) || mas_is_ptr(mas)) { - mas_store_root(mas, wr_mas->entry); + switch (mas->store_type) { + case wr_invalid: + MT_BUG_ON(mas->tree, 1); return; - } - - if (unlikely(!mas_wr_walk(wr_mas))) { - mas_wr_spanning_store(wr_mas); - return; - } - - /* At this point, we are at the leaf node that needs to be altered. */ - mas_wr_end_piv(wr_mas); - /* New root for a single pointer */ - if (unlikely(!mas->index && mas->last == ULONG_MAX)) + case wr_new_root: mas_new_root(mas, wr_mas->entry); - else - mas_wr_modify(wr_mas); + break; + case wr_store_root: + mas_store_root(mas, wr_mas->entry); + break; + case wr_exact_fit: + rcu_assign_pointer(wr_mas->slots[mas->offset], wr_mas->entry); + if (!!wr_mas->entry ^ !!wr_mas->content) + mas_update_gap(mas); + break; + case wr_append: + mas_wr_append(wr_mas, new_end); + break; + case wr_slot_store: + mas_wr_slot_store(wr_mas); + break; + case wr_node_store: + mas_wr_node_store(wr_mas, new_end); + break; + case wr_spanning_store: + mas_wr_spanning_store(wr_mas); + break; + case wr_split_store: + case wr_rebalance: + mas_wr_bnode(wr_mas); + break; + } + + return; } -static void mas_wr_store_setup(struct ma_wr_state *wr_mas) +static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) { - if (!mas_is_active(wr_mas->mas)) { - if (mas_is_start(wr_mas->mas)) - return; + struct ma_state *mas = wr_mas->mas; - if (unlikely(mas_is_paused(wr_mas->mas))) + if (!mas_is_active(mas)) { + if (mas_is_start(mas)) + goto set_content; + + if (unlikely(mas_is_paused(mas))) goto reset; - if (unlikely(mas_is_none(wr_mas->mas))) + if (unlikely(mas_is_none(mas))) goto reset; - if (unlikely(mas_is_overflow(wr_mas->mas))) + if (unlikely(mas_is_overflow(mas))) goto reset; - if (unlikely(mas_is_underflow(wr_mas->mas))) + if (unlikely(mas_is_underflow(mas))) goto reset; } @@ -4251,27 +4270,20 @@ static void mas_wr_store_setup(struct ma_wr_state *wr_mas) * writes within this node. This is to stop partial walks in * mas_prealloc() from being reset. */ - if (wr_mas->mas->last > wr_mas->mas->max) + if (mas->last > mas->max) goto reset; if (wr_mas->entry) - return; + goto set_content; - if (mte_is_leaf(wr_mas->mas->node) && - wr_mas->mas->last == wr_mas->mas->max) + if (mte_is_leaf(mas->node) && mas->last == mas->max) goto reset; - return; + goto set_content; reset: - mas_reset(wr_mas->mas); -} - -static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas) -{ - struct ma_state *mas = wr_mas->mas; - - mas_wr_store_setup(wr_mas); + mas_reset(mas); +set_content: wr_mas->content = mas_start(mas); } @@ -5582,7 +5594,8 @@ void *mas_store(struct ma_state *mas, void *entry) * want to examine what happens if a single store operation was to * overwrite multiple entries within a self-balancing B-Tree. */ - mas_wr_store_setup(&wr_mas); + mas_wr_prealloc_setup(&wr_mas); + mas_wr_store_type(&wr_mas); mas_wr_store_entry(&wr_mas); return wr_mas.content; } @@ -5634,7 +5647,8 @@ void mas_store_prealloc(struct ma_state *mas, void *entry) { MA_WR_STATE(wr_mas, mas, entry); - mas_wr_store_setup(&wr_mas); + mas_wr_prealloc_setup(&wr_mas); + mas_wr_store_type(&wr_mas); trace_ma_write(__func__, mas, 0, entry); mas_wr_store_entry(&wr_mas); MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas)); From 1fd7c4f3228e9775c97b25a6e5e2420df8cf0e76 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:37 -0400 Subject: [PATCH 131/416] maple_tree: convert mas_insert() to preallocate nodes By setting the store type in mas_insert(), we no longer need to use mas_wr_modify() to determine the correct store function to use. Instead, set the store type and call mas_wr_store_entry(). Also, pass in the requested gfp flags to mas_insert() so they can be passed to the call to mas_wr_preallocate(). Link: https://lkml.kernel.org/r/20240814161944.55347-11-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2242e07a46dc..bb940f61d713 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4457,26 +4457,24 @@ static inline void *mas_insert(struct ma_state *mas, void *entry) if (wr_mas.content) goto exists; - if (mas_is_none(mas) || mas_is_ptr(mas)) { - mas_store_root(mas, entry); + mas_wr_preallocate(&wr_mas, entry); + if (mas_is_err(mas)) return NULL; - } /* spanning writes always overwrite something */ - if (!mas_wr_walk(&wr_mas)) + if (mas->store_type == wr_spanning_store) goto exists; /* At this point, we are at the leaf node that needs to be altered. */ - wr_mas.offset_end = mas->offset; - wr_mas.end_piv = wr_mas.r_max; + if (mas->store_type != wr_new_root && mas->store_type != wr_store_root) { + wr_mas.offset_end = mas->offset; + wr_mas.end_piv = wr_mas.r_max; - if (wr_mas.content || (mas->last > wr_mas.r_max)) - goto exists; + if (wr_mas.content || (mas->last > wr_mas.r_max)) + goto exists; + } - if (!entry) - return NULL; - - mas_wr_modify(&wr_mas); + mas_wr_store_entry(&wr_mas); return wr_mas.content; exists: From 62c7b2b9842c541a09e1cf824ed738ee30069e6f Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:38 -0400 Subject: [PATCH 132/416] maple_tree: simplify mas_commit_b_node() The only callers of mas_commit_b_node() are those with store type of wr_rebalance and wr_split_store. Use mas->store_type to dispatch to the correct helper function. This allows the removal of mas_reuse_node() as it is no longer used. Link: https://lkml.kernel.org/r/20240814161944.55347-12-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 64 ++++++------------------------------------------ 1 file changed, 7 insertions(+), 57 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index bb940f61d713..0314e9b52621 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -3395,72 +3395,22 @@ static int mas_split(struct ma_state *mas, struct maple_big_node *b_node) return 1; } -/* - * mas_reuse_node() - Reuse the node to store the data. - * @wr_mas: The maple write state - * @bn: The maple big node - * @end: The end of the data. - * - * Will always return false in RCU mode. - * - * Return: True if node was reused, false otherwise. - */ -static inline bool mas_reuse_node(struct ma_wr_state *wr_mas, - struct maple_big_node *bn, unsigned char end) -{ - /* Need to be rcu safe. */ - if (mt_in_rcu(wr_mas->mas->tree)) - return false; - - if (end > bn->b_end) { - int clear = mt_slots[wr_mas->type] - bn->b_end; - - memset(wr_mas->slots + bn->b_end, 0, sizeof(void *) * clear--); - memset(wr_mas->pivots + bn->b_end, 0, sizeof(void *) * clear); - } - mab_mas_cp(bn, 0, bn->b_end, wr_mas->mas, false); - return true; -} - /* * mas_commit_b_node() - Commit the big node into the tree. * @wr_mas: The maple write state * @b_node: The maple big node - * @end: The end of the data. */ static noinline_for_kasan int mas_commit_b_node(struct ma_wr_state *wr_mas, - struct maple_big_node *b_node, unsigned char end) + struct maple_big_node *b_node) { - struct maple_node *node; - struct maple_enode *old_enode; - unsigned char b_end = b_node->b_end; - enum maple_type b_type = b_node->type; + enum store_type type = wr_mas->mas->store_type; - old_enode = wr_mas->mas->node; - if ((b_end < mt_min_slots[b_type]) && - (!mte_is_root(old_enode)) && - (mas_mt_height(wr_mas->mas) > 1)) + WARN_ON_ONCE(type != wr_rebalance && type != wr_split_store); + + if (type == wr_rebalance) return mas_rebalance(wr_mas->mas, b_node); - if (b_end >= mt_slots[b_type]) - return mas_split(wr_mas->mas, b_node); - - if (mas_reuse_node(wr_mas, b_node, end)) - goto reuse_node; - - mas_node_count(wr_mas->mas, 1); - if (mas_is_err(wr_mas->mas)) - return 0; - - node = mas_pop_node(wr_mas->mas); - node->parent = mas_mn(wr_mas->mas)->parent; - wr_mas->mas->node = mt_mk_node(node, b_type); - mab_mas_cp(b_node, 0, b_end, wr_mas->mas, false); - mas_replace_node(wr_mas->mas, old_enode); -reuse_node: - mas_update_gap(wr_mas->mas); - wr_mas->mas->end = b_end; - return 1; + return mas_split(wr_mas->mas, b_node); } /* @@ -4155,7 +4105,7 @@ static void mas_wr_bnode(struct ma_wr_state *wr_mas) trace_ma_write(__func__, wr_mas->mas, 0, wr_mas->entry); memset(&b_node, 0, sizeof(struct maple_big_node)); mas_store_b_node(wr_mas, &b_node, wr_mas->offset_end); - mas_commit_b_node(wr_mas, &b_node, wr_mas->mas->end); + mas_commit_b_node(wr_mas, &b_node); } static inline void mas_wr_modify(struct ma_wr_state *wr_mas) From 7987d027799cf3a300ace1a22c5e61da878b759a Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:39 -0400 Subject: [PATCH 133/416] maple_tree: remove mas_wr_modify() There are no more users of the function, safely remove it. Link: https://lkml.kernel.org/r/20240814161944.55347-13-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 38 -------------------------------------- 1 file changed, 38 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 0314e9b52621..6640ca775808 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -4108,44 +4108,6 @@ static void mas_wr_bnode(struct ma_wr_state *wr_mas) mas_commit_b_node(wr_mas, &b_node); } -static inline void mas_wr_modify(struct ma_wr_state *wr_mas) -{ - struct ma_state *mas = wr_mas->mas; - unsigned char new_end; - - /* Direct replacement */ - if (wr_mas->r_min == mas->index && wr_mas->r_max == mas->last) { - rcu_assign_pointer(wr_mas->slots[mas->offset], wr_mas->entry); - if (!!wr_mas->entry ^ !!wr_mas->content) - mas_update_gap(mas); - return; - } - - /* - * new_end exceeds the size of the maple node and cannot enter the fast - * path. - */ - new_end = mas_wr_new_end(wr_mas); - if (new_end >= mt_slots[wr_mas->type]) - goto slow_path; - - /* Attempt to append */ - if (mas_wr_append(wr_mas, new_end)) - return; - - if (new_end == mas->end && mas_wr_slot_store(wr_mas)) - return; - - if (mas_wr_node_store(wr_mas, new_end)) - return; - - if (mas_is_err(mas)) - return; - -slow_path: - mas_wr_bnode(wr_mas); -} - /* * mas_wr_store_entry() - Internal call to store a value * @mas: The maple state From 4037d44f548fe1f9ca4ad002c39a0eb84d79de60 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:40 -0400 Subject: [PATCH 134/416] maple_tree: have mas_store() allocate nodes if needed Not all users of mas_store() enter with nodes already preallocated. Check for the MA_STATE_PREALLOC flag to decide whether to preallocate nodes within mas_store() rather than relying on future write helper functions to perform the allocations. This allows the write helper functions to be simplified as they do not have to do checks to make sure there are enough allocated nodes to perform the write. Link: https://lkml.kernel.org/r/20240814161944.55347-14-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6640ca775808..d5e020dd93fa 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5477,13 +5477,12 @@ static inline void mte_destroy_walk(struct maple_enode *enode, * @entry: The entry to store. * * The @mas->index and @mas->last is used to set the range for the @entry. - * Note: The @mas should have pre-allocated entries to ensure there is memory to - * store the entry. Please see mas_expected_entries()/mas_destroy() for more details. * * Return: the first entry between mas->index and mas->last or %NULL. */ void *mas_store(struct ma_state *mas, void *entry) { + int request; MA_WR_STATE(wr_mas, mas, entry); trace_ma_write(__func__, mas, 0, entry); @@ -5506,7 +5505,23 @@ void *mas_store(struct ma_state *mas, void *entry) */ mas_wr_prealloc_setup(&wr_mas); mas_wr_store_type(&wr_mas); + if (mas->mas_flags & MA_STATE_PREALLOC) { + mas_wr_store_entry(&wr_mas); + MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas)); + return wr_mas.content; + } + + request = mas_prealloc_calc(mas, entry); + if (!request) + goto store; + + mas_node_count(mas, request); + if (mas_is_err(mas)) + return NULL; + +store: mas_wr_store_entry(&wr_mas); + mas_destroy(mas); return wr_mas.content; } EXPORT_SYMBOL_GPL(mas_store); From 9155e8433498cc8be2fdfaf81a31393d3e40781c Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:41 -0400 Subject: [PATCH 135/416] maple_tree: remove node allocations from various write helper functions These write helper functions are all called from store paths which preallocate enough nodes that will be needed for the write. There is no more need to allocate within the functions themselves. Link: https://lkml.kernel.org/r/20240814161944.55347-15-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index d5e020dd93fa..5f79be184377 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2976,9 +2976,6 @@ static inline int mas_rebalance(struct ma_state *mas, * tries to combine the data in the same way. If one node contains the * entire range of the tree, then that node is used as a new root node. */ - mas_node_count(mas, empty_count * 2 - 1); - if (mas_is_err(mas)) - return 0; mast.orig_l = &l_mas; mast.orig_r = &r_mas; @@ -3029,11 +3026,6 @@ static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end /* set up node. */ if (in_rcu) { - /* Allocate for both left and right as well as parent. */ - mas_node_count(mas, 3); - if (mas_is_err(mas)) - return; - newnode = mas_pop_node(mas); } else { newnode = &reuse; @@ -3341,10 +3333,6 @@ static int mas_split(struct ma_state *mas, struct maple_big_node *b_node) trace_ma_op(__func__, mas); mas->depth = mas_mt_height(mas); - /* Allocation failures will happen early. */ - mas_node_count(mas, 1 + mas->depth * 2); - if (mas_is_err(mas)) - return 0; mast.l = &l_mas; mast.r = &r_mas; @@ -3427,10 +3415,6 @@ static inline int mas_root_expand(struct ma_state *mas, void *entry) unsigned long *pivots; int slot = 0; - mas_node_count(mas, 1); - if (unlikely(mas_is_err(mas))) - return 0; - node = mas_pop_node(mas); pivots = ma_pivots(node, type); slots = ma_slots(node, type); @@ -3699,10 +3683,6 @@ static inline int mas_new_root(struct ma_state *mas, void *entry) goto done; } - mas_node_count(mas, 1); - if (mas_is_err(mas)) - return 0; - node = mas_pop_node(mas); pivots = ma_pivots(node, type); slots = ma_slots(node, type); @@ -3765,9 +3745,6 @@ static noinline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) * entries per level plus a new root. */ height = mas_mt_height(mas); - mas_node_count(mas, 1 + height * 3); - if (mas_is_err(mas)) - return 0; /* * Set up right side. Need to get to the next offset after the spanning @@ -3851,10 +3828,6 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas, /* set up node. */ if (in_rcu) { - mas_node_count(mas, 1); - if (mas_is_err(mas)) - return false; - newnode = mas_pop_node(mas); } else { memset(&reuse, 0, sizeof(struct maple_node)); From add60ea5f6d84aa73171d4a52b02f565b0b80a90 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:42 -0400 Subject: [PATCH 136/416] maple_tree: remove repeated sanity checks from write helper functions These sanity checks are now redundant as they are already checked in mas_wr_store_type(). We can remove them from mas_wr_append() and mas_wr_node_store(). Link: https://lkml.kernel.org/r/20240814161944.55347-16-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 5f79be184377..8c1a1a483395 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -3816,11 +3816,6 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas, unsigned char copy_size, node_pivots = mt_pivots[wr_mas->type]; bool in_rcu = mt_in_rcu(mas->tree); - /* Check if there is enough data. The room is enough. */ - if (!mte_is_root(mas->node) && (new_end < mt_min_slots[wr_mas->type]) && - !(mas->mas_flags & MA_STATE_BULK)) - return false; - if (mas->last == wr_mas->end_piv) offset_end++; /* don't copy this offset */ else if (unlikely(wr_mas->r_max == ULONG_MAX)) @@ -4018,17 +4013,9 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas) static inline bool mas_wr_append(struct ma_wr_state *wr_mas, unsigned char new_end) { - struct ma_state *mas; + struct ma_state *mas = wr_mas->mas; void __rcu **slots; - unsigned char end; - - mas = wr_mas->mas; - if (mt_in_rcu(mas->tree)) - return false; - - end = mas->end; - if (mas->offset != end) - return false; + unsigned char end = mas->end; if (new_end < mt_pivots[wr_mas->type]) { wr_mas->pivots[new_end] = wr_mas->pivots[end]; From c27e6183c654c729bfb7248a9fa938ce74def4be Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:43 -0400 Subject: [PATCH 137/416] maple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc() Users of mas_store_prealloc() enter this function with nodes already preallocated. This means the store type must be already set. We can then remove the call to mas_wr_store_type() and initialize the write state to continue the partial walk that was done when determining the store type. Link: https://lkml.kernel.org/r/20240814161944.55347-17-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett Signed-off-by: Sidhartha Kumar Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 8c1a1a483395..73ce63d9c3a0 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -3979,9 +3979,6 @@ static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas) wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end]; else wr_mas->end_piv = wr_mas->mas->max; - - if (!wr_mas->entry) - mas_wr_extend_null(wr_mas); } static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas) @@ -5532,8 +5529,19 @@ void mas_store_prealloc(struct ma_state *mas, void *entry) { MA_WR_STATE(wr_mas, mas, entry); - mas_wr_prealloc_setup(&wr_mas); - mas_wr_store_type(&wr_mas); + if (mas->store_type == wr_store_root) { + mas_wr_prealloc_setup(&wr_mas); + goto store; + } + + mas_wr_walk_descend(&wr_mas); + if (mas->store_type != wr_spanning_store) { + /* set wr_mas->content to current slot */ + wr_mas.content = mas_slot_locked(mas, wr_mas.slots, mas->offset); + mas_wr_end_piv(&wr_mas); + } + +store: trace_ma_write(__func__, mas, 0, entry); mas_wr_store_entry(&wr_mas); MAS_WR_BUG_ON(&wr_mas, mas_is_err(mas)); From ed4dfd9aa1b1eb18f5bd3a07e7002da41ce20817 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 14 Aug 2024 12:19:44 -0400 Subject: [PATCH 138/416] maple_tree: make write helper functions void The return value of various write helper functions are not checked. We can safely change the return type of these functions to be void. Link: https://lkml.kernel.org/r/20240814161944.55347-18-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Liam R. Howlett Cc: Matthew Wilcox (Oracle) Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- lib/maple_tree.c | 47 ++++++++++++++++------------------------------- 1 file changed, 16 insertions(+), 31 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 73ce63d9c3a0..755ba8b18e14 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2823,10 +2823,8 @@ dead_node: * orig_l_mas->last is used in mas_consume to find the slots that will need to * be either freed or destroyed. orig_l_mas->depth keeps track of the height of * the new sub-tree in case the sub-tree becomes the full tree. - * - * Return: the number of elements in b_node during the last loop. */ -static int mas_spanning_rebalance(struct ma_state *mas, +static void mas_spanning_rebalance(struct ma_state *mas, struct maple_subtree_state *mast, unsigned char count) { unsigned char split, mid_split; @@ -2942,7 +2940,7 @@ new_root: mas->offset = l_mas.offset; mas_wmb_replace(mas, old_enode); mtree_range_walk(mas); - return mast->bn->b_end; + return; } /* @@ -2952,10 +2950,8 @@ new_root: * * Rebalance two nodes into a single node or two new nodes that are sufficient. * Continue upwards until tree is sufficient. - * - * Return: the number of elements in b_node during the last loop. */ -static inline int mas_rebalance(struct ma_state *mas, +static inline void mas_rebalance(struct ma_state *mas, struct maple_big_node *b_node) { char empty_count = mas_mt_height(mas); @@ -3300,9 +3296,8 @@ static inline bool mas_push_data(struct ma_state *mas, int height, * mas_split() - Split data that is too big for one node into two. * @mas: The maple state * @b_node: The maple big node - * Return: 1 on success, 0 on failure. */ -static int mas_split(struct ma_state *mas, struct maple_big_node *b_node) +static void mas_split(struct ma_state *mas, struct maple_big_node *b_node) { struct maple_subtree_state mast; int height = 0; @@ -3380,7 +3375,7 @@ static int mas_split(struct ma_state *mas, struct maple_big_node *b_node) mas->node = l_mas.node; mas_wmb_replace(mas, old); mtree_range_walk(mas); - return 1; + return; } /* @@ -3388,7 +3383,7 @@ static int mas_split(struct ma_state *mas, struct maple_big_node *b_node) * @wr_mas: The maple write state * @b_node: The maple big node */ -static noinline_for_kasan int mas_commit_b_node(struct ma_wr_state *wr_mas, +static noinline_for_kasan void mas_commit_b_node(struct ma_wr_state *wr_mas, struct maple_big_node *b_node) { enum store_type type = wr_mas->mas->store_type; @@ -3664,10 +3659,8 @@ static void mte_destroy_walk(struct maple_enode *, struct maple_tree *); * @entry: The entry to store. * * Only valid when the index == 0 and the last == ULONG_MAX - * - * Return 0 on error, 1 on success. */ -static inline int mas_new_root(struct ma_state *mas, void *entry) +static inline void mas_new_root(struct ma_state *mas, void *entry) { struct maple_enode *root = mas_root_locked(mas); enum maple_type type = maple_leaf_64; @@ -3699,7 +3692,7 @@ done: if (xa_is_node(root)) mte_destroy_walk(root, mas->tree); - return 1; + return; } /* * mas_wr_spanning_store() - Create a subtree with the store operation completed @@ -3707,10 +3700,8 @@ done: * Note that mas is expected to point to the node which caused the store to * span. * @wr_mas: The maple write state - * - * Return: 0 on error, positive on success. */ -static noinline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) +static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) { struct maple_subtree_state mast; struct maple_big_node b_node; @@ -3802,10 +3793,8 @@ static noinline int mas_wr_spanning_store(struct ma_wr_state *wr_mas) * @wr_mas: The maple write state * * Attempts to reuse the node, but may allocate. - * - * Return: True if stored, false otherwise */ -static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas, +static inline void mas_wr_node_store(struct ma_wr_state *wr_mas, unsigned char new_end) { struct ma_state *mas = wr_mas->mas; @@ -3878,16 +3867,14 @@ done: trace_ma_write(__func__, mas, 0, wr_mas->entry); mas_update_gap(mas); mas->end = new_end; - return true; + return; } /* * mas_wr_slot_store: Attempt to store a value in a slot. * @wr_mas: the maple write state - * - * Return: True if stored, false otherwise */ -static inline bool mas_wr_slot_store(struct ma_wr_state *wr_mas) +static inline void mas_wr_slot_store(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; unsigned char offset = mas->offset; @@ -3919,7 +3906,7 @@ static inline bool mas_wr_slot_store(struct ma_wr_state *wr_mas) wr_mas->pivots[offset + 1] = mas->last; mas->offset++; /* Keep mas accurate. */ } else { - return false; + return; } trace_ma_write(__func__, mas, 0, wr_mas->entry); @@ -3930,7 +3917,7 @@ static inline bool mas_wr_slot_store(struct ma_wr_state *wr_mas) if (!wr_mas->entry || gap) mas_update_gap(mas); - return true; + return; } static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas) @@ -4004,10 +3991,8 @@ static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas) * This is currently unsafe in rcu mode since the end of the node may be cached * by readers while the node contents may be updated which could result in * inaccurate information. - * - * Return: True if appended, false otherwise */ -static inline bool mas_wr_append(struct ma_wr_state *wr_mas, +static inline void mas_wr_append(struct ma_wr_state *wr_mas, unsigned char new_end) { struct ma_state *mas = wr_mas->mas; @@ -4046,7 +4031,7 @@ static inline bool mas_wr_append(struct ma_wr_state *wr_mas, mas->end = new_end; trace_ma_write(__func__, mas, new_end, wr_mas->entry); - return true; + return; } /* From dd4d30d1cdbe826e6569b44453c2d9bb9424d234 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Wed, 14 Aug 2024 14:02:47 +1200 Subject: [PATCH 139/416] mm: override mTHP "enabled" defaults at kernel cmdline Add thp_anon= cmdline parameter to allow specifying the default enablement of each supported anon THP size. The parameter accepts the following format and can be provided multiple times to configure each size: thp_anon=,[KMG]:;-[KMG]: An example: thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never See Documentation/admin-guide/mm/transhuge.rst for more details. Configuring the defaults at boot time is useful to allow early user space to take advantage of mTHP before its been configured through sysfs. [v-songbaohua@oppo.com: use get_oder() and check size is is_power_of_2] Link: https://lkml.kernel.org/r/20240814224635.43272-1-21cnbao@gmail.com [ryan.roberts@arm.com: some minor cleanup according to David's comments] Link: https://lkml.kernel.org/r/20240820105244.62703-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240814020247.67297-1-21cnbao@gmail.com Signed-off-by: Ryan Roberts Co-developed-by: Barry Song Signed-off-by: Barry Song Reviewed-by: Baolin Wang Tested-by: Baolin Wang Acked-by: David Hildenbrand Cc: Jonathan Corbet Cc: Lance Yang Signed-off-by: Andrew Morton --- .../admin-guide/kernel-parameters.txt | 9 ++ Documentation/admin-guide/mm/transhuge.rst | 36 +++++-- mm/huge_memory.c | 98 ++++++++++++++++++- 3 files changed, 136 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 09126bb8cc9f..998076fd4882 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6614,6 +6614,15 @@ : poll all this frequency 0: no polling (default) + thp_anon= [KNL] + Format: ,[KMG]:;-[KMG]: + state is one of "always", "madvise", "never" or "inherit". + Control the default behavior of the system with respect + to anonymous transparent hugepages. + Can be used multiple times for multiple anon THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + threadirqs [KNL,EARLY] Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..79435c537e21 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -284,13 +284,37 @@ that THP is shared. Exceeding the number would block the collapse:: A higher value may increase memory footprint for some workloads. -Boot parameter -============== +Boot parameters +=============== -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` -to the kernel command line. +You can change the sysfs boot time default for the top-level "enabled" +control by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the +kernel command line. + +Alternatively, each supported anonymous THP size can be controlled by +passing ``thp_anon=,[KMG]:;-[KMG]:``, +where ```` is the THP size (must be a power of 2 of PAGE_SIZE and +supported anonymous THP) and ```` is one of ``always``, ``madvise``, +``never`` or ``inherit``. + +For example, the following will set 16K, 32K, 64K THP to ``always``, +set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M +to ``never``:: + + thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never + +``thp_anon=`` may be specified multiple times to configure all THP sizes as +required. If ``thp_anon=`` is specified at least once, any anon THP sizes +not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage`` setting only affects the global toggle. If +``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``. +However, if a valid ``thp_anon`` setting is provided by the user, the +PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER +is not defined within a valid ``thp_anon``, its policy will default to +``never``. Hugepages in tmpfs/shmem ======================== diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ac9e8a77059d..ef6d4f617c4a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -81,6 +81,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; +static bool anon_orders_configured __initdata; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, @@ -649,7 +650,8 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj) * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time * constant so we have to do this here. */ - huge_anon_orders_inherit = BIT(PMD_ORDER); + if (!anon_orders_configured) + huge_anon_orders_inherit = BIT(PMD_ORDER); *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj); if (unlikely(!*hugepage_kobj)) { @@ -834,6 +836,100 @@ out: } __setup("transparent_hugepage=", setup_transparent_hugepage); +static inline int get_order_from_str(const char *size_str) +{ + unsigned long size; + char *endptr; + int order; + + size = memparse(size_str, &endptr); + + if (!is_power_of_2(size)) + goto err; + order = get_order(size); + if (BIT(order) & ~THP_ORDERS_ALL_ANON) + goto err; + + return order; +err: + pr_err("invalid size %s in thp_anon boot parameter\n", size_str); + return -EINVAL; +} + +static char str_dup[PAGE_SIZE] __initdata; +static int __init setup_thp_anon(char *str) +{ + char *token, *range, *policy, *subtoken; + unsigned long always, inherit, madvise; + char *start_size, *end_size; + int start, end, nr; + char *p; + + if (!str || strlen(str) + 1 > PAGE_SIZE) + goto err; + strcpy(str_dup, str); + + always = huge_anon_orders_always; + madvise = huge_anon_orders_madvise; + inherit = huge_anon_orders_inherit; + p = str_dup; + while ((token = strsep(&p, ";")) != NULL) { + range = strsep(&token, ":"); + policy = token; + + if (!policy) + goto err; + + while ((subtoken = strsep(&range, ",")) != NULL) { + if (strchr(subtoken, '-')) { + start_size = strsep(&subtoken, "-"); + end_size = subtoken; + + start = get_order_from_str(start_size); + end = get_order_from_str(end_size); + } else { + start = end = get_order_from_str(subtoken); + } + + if (start < 0 || end < 0 || start > end) + goto err; + + nr = end - start + 1; + if (!strcmp(policy, "always")) { + bitmap_set(&always, start, nr); + bitmap_clear(&inherit, start, nr); + bitmap_clear(&madvise, start, nr); + } else if (!strcmp(policy, "madvise")) { + bitmap_set(&madvise, start, nr); + bitmap_clear(&inherit, start, nr); + bitmap_clear(&always, start, nr); + } else if (!strcmp(policy, "inherit")) { + bitmap_set(&inherit, start, nr); + bitmap_clear(&madvise, start, nr); + bitmap_clear(&always, start, nr); + } else if (!strcmp(policy, "never")) { + bitmap_clear(&inherit, start, nr); + bitmap_clear(&madvise, start, nr); + bitmap_clear(&always, start, nr); + } else { + pr_err("invalid policy %s in thp_anon boot parameter\n", policy); + goto err; + } + } + } + + huge_anon_orders_always = always; + huge_anon_orders_madvise = madvise; + huge_anon_orders_inherit = inherit; + anon_orders_configured = true; + return 1; + +err: + pr_warn("thp_anon=%s: error parsing string, ignoring setting\n", str); + return 0; +} +__setup("thp_anon=", setup_thp_anon); + pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) { if (likely(vma->vm_flags & VM_WRITE)) From 5d383b69a04e2eb2e0ae06e91821fd82cb9acf73 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:47 -0700 Subject: [PATCH 140/416] memcg: move v1 only percpu stats in separate struct Patch series "memcg: further decouple v1 code from v2". Some of the v1 code is still in v2 code base due to v1 fields in the struct memcg_vmstats_percpu. This field decouples those fileds from v2 struct and move all the related code into v1 only code base. This patch (of 7): At the moment struct memcg_vmstats_percpu contains two v1 only fields which consumes memory even when CONFIG_MEMCG_V1 is not enabled. In addition there are v1 only functions accessing them and are in the main memcontrol source file and can not be moved to v1 only source file due to these fields. Let's move these fields into their own struct. Later patches will move the functions accessing them to v1 source file and only allocate these fields when CONFIG_MEMCG_V1 is enabled. Link: https://lkml.kernel.org/r/20240815050453.1298138-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240815050453.1298138-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: T.J. Mercier Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 2 ++ mm/memcontrol-v1.h | 19 +++++++++++++++++++ mm/memcontrol.c | 18 +++++++++--------- 3 files changed, 30 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 07eadf7ecbba..08b7121cf43a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -70,6 +70,7 @@ struct mem_cgroup_id { }; struct memcg_vmstats_percpu; +struct memcg1_events_percpu; struct memcg_vmstats; struct lruvec_stats_percpu; struct lruvec_stats; @@ -254,6 +255,7 @@ struct mem_cgroup { struct list_head objcg_list; struct memcg_vmstats_percpu __percpu *vmstats_percpu; + struct memcg1_events_percpu __percpu *events_percpu; #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 56d7eaa98274..8feccecf8e2a 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -56,6 +56,12 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; +/* Cgroup1: threshold notifications & softlimit tree updates */ +struct memcg1_events_percpu { + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, enum mem_cgroup_events_target target); unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap); @@ -69,6 +75,19 @@ unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item); unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item); int memory_stat_show(struct seq_file *m, void *v); +static inline bool memcg1_alloc_events(struct mem_cgroup *memcg) +{ + memcg->events_percpu = alloc_percpu_gfp(struct memcg1_events_percpu, + GFP_KERNEL_ACCOUNT); + return !!memcg->events_percpu; +} + +static inline void memcg1_free_events(struct mem_cgroup *memcg) +{ + if (memcg->events_percpu) + free_percpu(memcg->events_percpu); +} + /* Cgroup v1-specific declarations */ #ifdef CONFIG_MEMCG_V1 void memcg1_memcg_init(struct mem_cgroup *memcg); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 96f85fd0e5b3..36ec28d3d25e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -477,10 +477,6 @@ struct memcg_vmstats_percpu { /* Delta calculation for lockless upward propagation */ long state_prev[MEMCG_VMSTAT_SIZE]; unsigned long events_prev[NR_MEMCG_EVENTS]; - - /* Cgroup1: threshold notifications & softlimit tree updates */ - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; } ____cacheline_aligned; struct memcg_vmstats { @@ -857,7 +853,7 @@ void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) nr_pages = -nr_pages; /* for event */ } - __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); + __this_cpu_add(memcg->events_percpu->nr_page_events, nr_pages); } bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, @@ -865,8 +861,8 @@ bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, { unsigned long val, next; - val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); - next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); + val = __this_cpu_read(memcg->events_percpu->nr_page_events); + next = __this_cpu_read(memcg->events_percpu->targets[target]); /* from time_after() in jiffies.h */ if ((long)(next - val) < 0) { switch (target) { @@ -879,7 +875,7 @@ bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, default: break; } - __this_cpu_write(memcg->vmstats_percpu->targets[target], next); + __this_cpu_write(memcg->events_percpu->targets[target], next); return true; } return false; @@ -3477,6 +3473,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); + memcg1_free_events(memcg); kfree(memcg->vmstats); free_percpu(memcg->vmstats_percpu); kfree(memcg); @@ -3517,6 +3514,9 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) if (!memcg->vmstats_percpu) goto fail; + if (!memcg1_alloc_events(memcg)) + goto fail; + for_each_possible_cpu(cpu) { if (parent) pstatc = per_cpu_ptr(parent->vmstats_percpu, cpu); @@ -4631,7 +4631,7 @@ static void uncharge_batch(const struct uncharge_gather *ug) local_irq_save(flags); __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + __this_cpu_add(ug->memcg->events_percpu->nr_page_events, ug->nr_memory); memcg1_check_events(ug->memcg, ug->nid); local_irq_restore(flags); From 41213dd0f81654f0a7863bb5b07a2ebaaf732ebe Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:48 -0700 Subject: [PATCH 141/416] memcg: move mem_cgroup_event_ratelimit to v1 code There are no callers of mem_cgroup_event_ratelimit() in the v2 code. Move it to v1 only code and rename it to memcg1_event_ratelimit(). Link: https://lkml.kernel.org/r/20240815050453.1298138-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- mm/memcontrol-v1.c | 32 ++++++++++++++++++++++++++++++-- mm/memcontrol-v1.h | 2 -- mm/memcontrol.c | 28 ---------------------------- 3 files changed, 30 insertions(+), 32 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 417c96f2da28..38809a26089a 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1439,6 +1439,34 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg) } } +#define THRESHOLDS_EVENTS_TARGET 128 +#define SOFTLIMIT_EVENTS_TARGET 1024 + +static bool memcg1_event_ratelimit(struct mem_cgroup *memcg, + enum mem_cgroup_events_target target) +{ + unsigned long val, next; + + val = __this_cpu_read(memcg->events_percpu->nr_page_events); + next = __this_cpu_read(memcg->events_percpu->targets[target]); + /* from time_after() in jiffies.h */ + if ((long)(next - val) < 0) { + switch (target) { + case MEM_CGROUP_TARGET_THRESH: + next = val + THRESHOLDS_EVENTS_TARGET; + break; + case MEM_CGROUP_TARGET_SOFTLIMIT: + next = val + SOFTLIMIT_EVENTS_TARGET; + break; + default: + break; + } + __this_cpu_write(memcg->events_percpu->targets[target], next); + return true; + } + return false; +} + /* * Check events in order. * @@ -1449,11 +1477,11 @@ void memcg1_check_events(struct mem_cgroup *memcg, int nid) return; /* threshold event is triggered in finer grain than soft limit */ - if (unlikely(mem_cgroup_event_ratelimit(memcg, + if (unlikely(memcg1_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH))) { bool do_softlimit; - do_softlimit = mem_cgroup_event_ratelimit(memcg, + do_softlimit = memcg1_event_ratelimit(memcg, MEM_CGROUP_TARGET_SOFTLIMIT); mem_cgroup_threshold(memcg); if (unlikely(do_softlimit)) diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 8feccecf8e2a..fb7d439f19de 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -62,8 +62,6 @@ struct memcg1_events_percpu { unsigned long targets[MEM_CGROUP_NTARGETS]; }; -bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, - enum mem_cgroup_events_target target); unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap); void drain_all_stock(struct mem_cgroup *root_memcg); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 36ec28d3d25e..35dec7e38870 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -95,9 +95,6 @@ static bool cgroup_memory_nobpf __ro_after_init; static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); #endif -#define THRESHOLDS_EVENTS_TARGET 128 -#define SOFTLIMIT_EVENTS_TARGET 1024 - static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || @@ -856,31 +853,6 @@ void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) __this_cpu_add(memcg->events_percpu->nr_page_events, nr_pages); } -bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, - enum mem_cgroup_events_target target) -{ - unsigned long val, next; - - val = __this_cpu_read(memcg->events_percpu->nr_page_events); - next = __this_cpu_read(memcg->events_percpu->targets[target]); - /* from time_after() in jiffies.h */ - if ((long)(next - val) < 0) { - switch (target) { - case MEM_CGROUP_TARGET_THRESH: - next = val + THRESHOLDS_EVENTS_TARGET; - break; - case MEM_CGROUP_TARGET_SOFTLIMIT: - next = val + SOFTLIMIT_EVENTS_TARGET; - break; - default: - break; - } - __this_cpu_write(memcg->events_percpu->targets[target], next); - return true; - } - return false; -} - struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) { /* From 7d7602b4bed9b4e70c3c0650791b6b4531733a42 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:49 -0700 Subject: [PATCH 142/416] memcg: move mem_cgroup_charge_statistics to v1 code There are no callers of mem_cgroup_charge_statistics() in the v2 code base, so move it to the v1 only code and rename it to memcg1_charge_statistics(). Link: https://lkml.kernel.org/r/20240815050453.1298138-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- mm/memcontrol-v1.c | 17 +++++++++++++++-- mm/memcontrol-v1.h | 3 ++- mm/memcontrol.c | 19 +++---------------- 3 files changed, 20 insertions(+), 19 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 38809a26089a..7317f2ce8876 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -853,9 +853,9 @@ static int mem_cgroup_move_account(struct folio *folio, nid = folio_nid(folio); local_irq_disable(); - mem_cgroup_charge_statistics(to, nr_pages); + memcg1_charge_statistics(to, nr_pages); memcg1_check_events(to, nid); - mem_cgroup_charge_statistics(from, -nr_pages); + memcg1_charge_statistics(from, -nr_pages); memcg1_check_events(from, nid); local_irq_enable(); out: @@ -1439,6 +1439,19 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg) } } +void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) +{ + /* pagein of a big page is an event. So, ignore page size */ + if (nr_pages > 0) + __count_memcg_events(memcg, PGPGIN, 1); + else { + __count_memcg_events(memcg, PGPGOUT, 1); + nr_pages = -nr_pages; /* for event */ + } + + __this_cpu_add(memcg->events_percpu->nr_page_events, nr_pages); +} + #define THRESHOLDS_EVENTS_TARGET 128 #define SOFTLIMIT_EVENTS_TARGET 1024 diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index fb7d439f19de..ef72d0b7c5c6 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -7,7 +7,6 @@ /* Cgroup v1 and v2 common declarations */ -void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages); int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages); @@ -116,6 +115,7 @@ bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked); void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked); void memcg1_oom_recover(struct mem_cgroup *memcg); +void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages); void memcg1_check_events(struct mem_cgroup *memcg, int nid); void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); @@ -147,6 +147,7 @@ static inline bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) { static inline void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) {} static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {} +static inline void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) {} static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {} static inline void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) {} diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 35dec7e38870..059c9e0ba350 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -840,19 +840,6 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) return READ_ONCE(memcg->vmstats->events_local[i]); } -void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) -{ - /* pagein of a big page is an event. So, ignore page size */ - if (nr_pages > 0) - __count_memcg_events(memcg, PGPGIN, 1); - else { - __count_memcg_events(memcg, PGPGOUT, 1); - nr_pages = -nr_pages; /* for event */ - } - - __this_cpu_add(memcg->events_percpu->nr_page_events, nr_pages); -} - struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) { /* @@ -2366,7 +2353,7 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg) commit_charge(folio, memcg); local_irq_disable(); - mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio)); + memcg1_charge_statistics(memcg, folio_nr_pages(folio)); memcg1_check_events(memcg, folio_nid(folio)); local_irq_enable(); } @@ -4742,7 +4729,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) commit_charge(new, memcg); local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, nr_pages); + memcg1_charge_statistics(memcg, nr_pages); memcg1_check_events(memcg, folio_nid(new)); local_irq_restore(flags); } @@ -4987,7 +4974,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) * only synchronisation we have for updating the per-CPU variables. */ memcg_stats_lock(); - mem_cgroup_charge_statistics(memcg, -nr_entries); + memcg1_charge_statistics(memcg, -nr_entries); memcg_stats_unlock(); memcg1_check_events(memcg, folio_nid(folio)); From f7d49ba03ae7555e6337b9d39eb080b4a064eab8 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:50 -0700 Subject: [PATCH 143/416] memcg: move v1 events and statistics code to v1 file Currently the common code path for charge commit, swapout and batched uncharge are executing v1 only code which is completely useless for the v2 deployments where CONFIG_MEMCG_V1 is disabled. In addition, it is mucking with IRQs which might be slow on some architectures. Let's move all of this code to v1 only code and remove them from v2 only deployments. Link: https://lkml.kernel.org/r/20240815050453.1298138-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- mm/memcontrol-v1.c | 37 +++++++++++++++++++++++++++++++++++++ mm/memcontrol-v1.h | 14 ++++++++++++++ mm/memcontrol.c | 33 ++++----------------------------- 3 files changed, 55 insertions(+), 29 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 7317f2ce8876..540b45ab4b26 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1502,6 +1502,43 @@ void memcg1_check_events(struct mem_cgroup *memcg, int nid) } } +void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg) +{ + unsigned long flags; + + local_irq_save(flags); + memcg1_charge_statistics(memcg, folio_nr_pages(folio)); + memcg1_check_events(memcg, folio_nid(folio)); + local_irq_restore(flags); +} + +void memcg1_swapout(struct folio *folio, struct mem_cgroup *memcg) +{ + /* + * Interrupts should be disabled here because the caller holds the + * i_pages lock which is taken with interrupts-off. It is + * important here to have the interrupts disabled because it is the + * only synchronisation we have for updating the per-CPU variables. + */ + preempt_disable_nested(); + VM_WARN_ON_IRQS_ENABLED(); + memcg1_charge_statistics(memcg, -folio_nr_pages(folio)); + preempt_enable_nested(); + memcg1_check_events(memcg, folio_nid(folio)); +} + +void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, + unsigned long nr_memory, int nid) +{ + unsigned long flags; + + local_irq_save(flags); + __count_memcg_events(memcg, PGPGOUT, pgpgout); + __this_cpu_add(memcg->events_percpu->nr_page_events, nr_memory); + memcg1_check_events(memcg, nid); + local_irq_restore(flags); +} + static int compare_thresholds(const void *a, const void *b) { const struct mem_cgroup_threshold *_a = a; diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index ef72d0b7c5c6..376d021a2bf4 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -118,6 +118,11 @@ void memcg1_oom_recover(struct mem_cgroup *memcg); void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages); void memcg1_check_events(struct mem_cgroup *memcg, int nid); +void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg); +void memcg1_swapout(struct folio *folio, struct mem_cgroup *memcg); +void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, + unsigned long nr_memory, int nid); + void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s); void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages); @@ -150,6 +155,15 @@ static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {} static inline void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) {} static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {} +static inline void memcg1_commit_charge(struct folio *folio, + struct mem_cgroup *memcg) {} + +static inline void memcg1_swapout(struct folio *folio, struct mem_cgroup *memcg) {} + +static inline void memcg1_uncharge_batch(struct mem_cgroup *memcg, + unsigned long pgpgout, + unsigned long nr_memory, int nid) {} + static inline void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) {} static inline void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages) {} diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 059c9e0ba350..6b4b51caf71b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2351,11 +2351,7 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg) { css_get(&memcg->css); commit_charge(folio, memcg); - - local_irq_disable(); - memcg1_charge_statistics(memcg, folio_nr_pages(folio)); - memcg1_check_events(memcg, folio_nid(folio)); - local_irq_enable(); + memcg1_commit_charge(folio, memcg); } static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg, @@ -4575,8 +4571,6 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug) static void uncharge_batch(const struct uncharge_gather *ug) { - unsigned long flags; - if (ug->nr_memory) { page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); if (do_memsw_account()) @@ -4588,11 +4582,7 @@ static void uncharge_batch(const struct uncharge_gather *ug) memcg1_oom_recover(ug->memcg); } - local_irq_save(flags); - __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->events_percpu->nr_page_events, ug->nr_memory); - memcg1_check_events(ug->memcg, ug->nid); - local_irq_restore(flags); + memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid); /* drop reference from uncharge_folio */ css_put(&ug->memcg->css); @@ -4699,7 +4689,6 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; long nr_pages = folio_nr_pages(new); - unsigned long flags; VM_BUG_ON_FOLIO(!folio_test_locked(old), old); VM_BUG_ON_FOLIO(!folio_test_locked(new), new); @@ -4727,11 +4716,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new) css_get(&memcg->css); commit_charge(new, memcg); - - local_irq_save(flags); - memcg1_charge_statistics(memcg, nr_pages); - memcg1_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + memcg1_commit_charge(new, memcg); } /** @@ -4967,17 +4952,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry) page_counter_uncharge(&memcg->memsw, nr_entries); } - /* - * Interrupts should be disabled here because the caller holds the - * i_pages lock which is taken with interrupts-off. It is - * important here to have the interrupts disabled because it is the - * only synchronisation we have for updating the per-CPU variables. - */ - memcg_stats_lock(); - memcg1_charge_statistics(memcg, -nr_entries); - memcg_stats_unlock(); - memcg1_check_events(memcg, folio_nid(folio)); - + memcg1_swapout(folio, memcg); css_put(&memcg->css); } From a5ebe6bbe52de8d96e628c5696bb9e6dff136ca0 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:51 -0700 Subject: [PATCH 144/416] memcg: make v1 only functions static The functions memcg1_charge_statistics() and memcg1_check_events() are never used outside of v1 source file. So, make them static. Link: https://lkml.kernel.org/r/20240815050453.1298138-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- mm/memcontrol-v1.c | 7 +++++-- mm/memcontrol-v1.h | 6 ------ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 540b45ab4b26..8168a0c965b7 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -742,6 +742,9 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma, return folio_file_page(folio, index); } +static void memcg1_check_events(struct mem_cgroup *memcg, int nid); +static void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages); + /** * mem_cgroup_move_account - move account of the folio * @folio: The folio. @@ -1439,7 +1442,7 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg) } } -void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) +static void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { /* pagein of a big page is an event. So, ignore page size */ if (nr_pages > 0) @@ -1484,7 +1487,7 @@ static bool memcg1_event_ratelimit(struct mem_cgroup *memcg, * Check events in order. * */ -void memcg1_check_events(struct mem_cgroup *memcg, int nid) +static void memcg1_check_events(struct mem_cgroup *memcg, int nid) { if (IS_ENABLED(CONFIG_PREEMPT_RT)) return; diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 376d021a2bf4..0a9f3f9c2362 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -115,9 +115,6 @@ bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked); void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked); void memcg1_oom_recover(struct mem_cgroup *memcg); -void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages); -void memcg1_check_events(struct mem_cgroup *memcg, int nid); - void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg); void memcg1_swapout(struct folio *folio, struct mem_cgroup *memcg); void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, @@ -152,9 +149,6 @@ static inline bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) { static inline void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) {} static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {} -static inline void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) {} -static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {} - static inline void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg) {} From 0ccaf421d6592bb99bb9b424e4ccca3c6367d799 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:52 -0700 Subject: [PATCH 145/416] memcg: allocate v1 event percpu only on v1 deployment Currently memcg->events_percpu gets allocated on v2 deployments. Let's move the allocation to v1 only codebase. This is not needed in v2. Link: https://lkml.kernel.org/r/20240815050453.1298138-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 3 ++- mm/memcontrol-v1.c | 19 +++++++++++++++++++ mm/memcontrol-v1.h | 26 +++++++------------------- 3 files changed, 28 insertions(+), 20 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 08b7121cf43a..ed170399179a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -255,7 +255,6 @@ struct mem_cgroup { struct list_head objcg_list; struct memcg_vmstats_percpu __percpu *vmstats_percpu; - struct memcg1_events_percpu __percpu *events_percpu; #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; @@ -277,6 +276,8 @@ struct mem_cgroup { struct page_counter kmem; /* v1 only */ struct page_counter tcpmem; /* v1 only */ + struct memcg1_events_percpu __percpu *events_percpu; + unsigned long soft_limit; /* protected by memcg_oom_lock */ diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 8168a0c965b7..ff972a35ecc5 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1442,6 +1442,12 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg) } } +/* Cgroup1: threshold notifications & softlimit tree updates */ +struct memcg1_events_percpu { + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + static void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { /* pagein of a big page is an event. So, ignore page size */ @@ -3033,6 +3039,19 @@ bool memcg1_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages, return false; } +bool memcg1_alloc_events(struct mem_cgroup *memcg) +{ + memcg->events_percpu = alloc_percpu_gfp(struct memcg1_events_percpu, + GFP_KERNEL_ACCOUNT); + return !!memcg->events_percpu; +} + +void memcg1_free_events(struct mem_cgroup *memcg) +{ + if (memcg->events_percpu) + free_percpu(memcg->events_percpu); +} + static int __init memcg1_init(void) { int node; diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 0a9f3f9c2362..3bb8b3030e61 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -55,12 +55,6 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; -/* Cgroup1: threshold notifications & softlimit tree updates */ -struct memcg1_events_percpu { - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; -}; - unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap); void drain_all_stock(struct mem_cgroup *root_memcg); @@ -72,21 +66,12 @@ unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item); unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item); int memory_stat_show(struct seq_file *m, void *v); -static inline bool memcg1_alloc_events(struct mem_cgroup *memcg) -{ - memcg->events_percpu = alloc_percpu_gfp(struct memcg1_events_percpu, - GFP_KERNEL_ACCOUNT); - return !!memcg->events_percpu; -} - -static inline void memcg1_free_events(struct mem_cgroup *memcg) -{ - if (memcg->events_percpu) - free_percpu(memcg->events_percpu); -} - /* Cgroup v1-specific declarations */ #ifdef CONFIG_MEMCG_V1 + +bool memcg1_alloc_events(struct mem_cgroup *memcg); +void memcg1_free_events(struct mem_cgroup *memcg); + void memcg1_memcg_init(struct mem_cgroup *memcg); void memcg1_remove_from_trees(struct mem_cgroup *memcg); @@ -139,6 +124,9 @@ extern struct cftype mem_cgroup_legacy_files[]; #else /* CONFIG_MEMCG_V1 */ +static inline bool memcg1_alloc_events(struct mem_cgroup *memcg) { return true; } +static inline void memcg1_free_events(struct mem_cgroup *memcg) {} + static inline void memcg1_memcg_init(struct mem_cgroup *memcg) {} static inline void memcg1_remove_from_trees(struct mem_cgroup *memcg) {} static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg) {} From 98455eef806448b2144fd47655e09abc9f6530c8 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 22:04:53 -0700 Subject: [PATCH 146/416] memcg: make PGPGIN and PGPGOUT v1 only Currently PGPGIN and PGPGOUT are used and exposed in the memcg v1 only code. So, let's put them under CONFIG_MEMCG_V1. Link: https://lkml.kernel.org/r/20240815050453.1298138-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: T.J. Mercier Signed-off-by: Andrew Morton --- mm/memcontrol.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6b4b51caf71b..1decbd38216e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -408,8 +408,10 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec, /* Subset of vm_event_item to report for memcg event stats */ static const unsigned int memcg_vm_event_stat[] = { +#ifdef CONFIG_MEMCG_V1 PGPGIN, PGPGOUT, +#endif PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_KHUGEPAGED, @@ -1429,10 +1431,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) memcg_events(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { +#ifdef CONFIG_MEMCG_V1 if (memcg_vm_event_stat[i] == PGPGIN || memcg_vm_event_stat[i] == PGPGOUT) continue; - +#endif seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg_vm_event_stat[i]), memcg_events(memcg, memcg_vm_event_stat[i])); From d046ff46ee3b920ac617564226d2a0494d9af772 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 15:00:18 -0700 Subject: [PATCH 147/416] memcg: initiate deprecation of v1 tcp accounting Patch series "memcg: initiate deprecation of v1 features", v2. Start the deprecation process of the memcg v1 features which we discussed during LSFMMBPF 2024 [1]. For now add the warnings to collect the information on how the current users are using these features. Next we will work on providing better alternatives in v2 (if needed) and fully deprecate these features. Link: https://lwn.net/Articles/974575 [1] This patch (of 4): Memcg v1 provides opt-in TCP memory accounting feature. However it is mostly unused due to its performance impact on the network traffic. In v2, the TCP memory is accounted in the regular memory usage and is transparent to the users but they can observe the TCP memory usage through memcg stats. Let's initiate the deprecation process of memcg v1's tcp accounting functionality and add warnings to gather if there are any users and if there are, collect how they are using it and plan to provide them better alternative in v2. Link: https://lkml.kernel.org/r/20240814220021.3208384-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240814220021.3208384-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: T.J. Mercier Acked-by: Michal Hocko Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Muchun Song Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 8 ++++++++ mm/memcontrol-v1.c | 3 +++ 2 files changed, 11 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 9cde26d33843..0114d758beab 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -105,10 +105,18 @@ Brief summary of control files. memory.kmem.max_usage_in_bytes show max kernel memory usage recorded memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + This knob is deprecated and shouldn't be + used. ==================================== ========================================== 1. History diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index ff972a35ecc5..89944f83afb3 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -2534,6 +2534,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of, ret = 0; break; case _TCP: + pr_warn_once("kmem.tcp.limit_in_bytes is deprecated and will be removed. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); ret = memcg_update_tcp_max(memcg, nr_pages); break; } From 569c4f62d84af874deaa2a31ad200f695d6d67dd Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 15:00:19 -0700 Subject: [PATCH 148/416] memcg: initiate deprecation of v1 soft limit Memcg v1 provides soft limit functionality for the best effort memory sharing between multiple workloads on a system. It is usually triggered through kswapd and at the moment does not reclaim kernel memory. Memcg v2 provides more straightforward best effort (memory.low) and hard protection (memory.min) functionalities. Let's initiate the deprecation of soft limit from v1 and gather if v2 needs something more to move the existing v1 users to v2 regarding soft limit. Link: https://lkml.kernel.org/r/20240814220021.3208384-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: T.J. Mercier Acked-by: Michal Hocko Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Muchun Song Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 8 ++++++-- mm/memcontrol-v1.c | 3 +++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 0114d758beab..6831c6c16e3f 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -78,6 +78,8 @@ Brief summary of control files. memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded memory.soft_limit_in_bytes set/show soft limit of memory usage This knob is not available on CONFIG_PREEMPT_RT systems. + This knob is deprecated and shouldn't be + used. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled This knob is deprecated and shouldn't be @@ -701,8 +703,10 @@ For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: # echo 1 > memory.use_hierarchy -7. Soft limits -============== +7. Soft limits (DEPRECATED) +=========================== + +THIS IS DEPRECATED! Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 89944f83afb3..fe3d8809a8be 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -2545,6 +2545,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of, if (IS_ENABLED(CONFIG_PREEMPT_RT)) { ret = -EOPNOTSUPP; } else { + pr_warn_once("soft_limit_in_bytes is deprecated and will be removed. " + "Please report your usecase to linux-mm@kvack.org if you " + "depend on this functionality.\n"); WRITE_ONCE(memcg->soft_limit, nr_pages); ret = 0; } From 6df4ad704707801f9e451f2adbac6bcc060ff728 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 15:00:20 -0700 Subject: [PATCH 149/416] memcg: initiate deprecation of oom_control The oom_control provides functionality to disable memcg oom-killer, notifications on oom-kill and reading the stats regarding oom-kills. This interface was mainly introduced to provide functionality for userspace oom-killers. However it is not robust enough and only supports OOM handling in the page fault path. For v2, the users can use the combination of memory.events notifications, memory.high and PSI to provide userspace OOM-killing functionality. Actually LMKD in Android and OOMd in systemd and Meta infrastructure already use PSI in combination with other stats to implement userspace OOM-killing. Let's start the deprecation process for v1 and gather the info on how the current users are using this interface and work on providing a more robust functionality in v2. Link: https://lkml.kernel.org/r/20240814220021.3208384-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: T.J. Mercier Acked-by: Michal Hocko Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Muchun Song Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 8 ++++++-- mm/memcontrol-v1.c | 7 +++++++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 6831c6c16e3f..0042206414c8 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -92,6 +92,8 @@ Brief summary of control files. This knob is deprecated and shouldn't be used. memory.oom_control set/show oom controls. + This knob is deprecated and shouldn't be + used. memory.numa_stat show the number of memory usage per numa node memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel @@ -846,8 +848,10 @@ It's applicable for root and non-root cgroup. .. _cgroup-v1-memory-oom-control: -10. OOM Control -=============== +10. OOM Control (DEPRECATED) +============================ + +THIS IS DEPRECATED! memory.oom_control file is for OOM notification and other controls. diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index fe3d8809a8be..3240f930d5f7 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -1994,6 +1994,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, event->register_event = mem_cgroup_usage_register_event; event->unregister_event = mem_cgroup_usage_unregister_event; } else if (!strcmp(name, "memory.oom_control")) { + pr_warn_once("oom_control is deprecated and will be removed. " + "Please report your usecase to linux-mm-@kvack.org" + " if you depend on this functionality. \n"); event->register_event = mem_cgroup_oom_register_event; event->unregister_event = mem_cgroup_oom_unregister_event; } else if (!strcmp(name, "memory.pressure_level")) { @@ -2841,6 +2844,10 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + pr_warn_once("oom_control is deprecated and will be removed. " + "Please report your usecase to linux-mm-@kvack.org if you " + "depend on this functionality. \n"); + /* cannot set to root cgroup and only 0 and 1 are allowed */ if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) return -EINVAL; From 340afb8027fab59e88a9993d9418a1ef81d3dc08 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Wed, 14 Aug 2024 15:00:21 -0700 Subject: [PATCH 150/416] memcg: initiate deprecation of pressure_level The pressure_level in memcg v1 provides memory pressure notifications to the user space. At the moment it provides notifications for three levels of memory pressure i.e. low, medium and critical, which are defined based on internal memory reclaim implementation details. More specifically the ratio of scanned and reclaimed pages during a memory reclaim. However this is not robust as there are workloads with mostly unreclaimable user memory or kernel memory. For v2, the users can use PSI for memory pressure status of the system or the cgroup. Let's start the deprecation process for pressure_level and add warnings to gather the info on how the current users are using this interface and how they can be used to PSI. Link: https://lkml.kernel.org/r/20240814220021.3208384-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: T.J. Mercier Acked-by: Michal Hocko Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Muchun Song Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 8 ++++++-- mm/memcontrol-v1.c | 3 +++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 0042206414c8..270501db9f4e 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -86,6 +86,8 @@ Brief summary of control files. used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications + This knob is deprecated and shouldn't be + used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate set/show controls of moving charges @@ -898,8 +900,10 @@ At reading, current status of OOM is shown. The number of processes belonging to this cgroup killed by any kind of OOM killer. -11. Memory Pressure -=================== +11. Memory Pressure (DEPRECATED) +================================ + +THIS IS DEPRECATED! The pressure level notifications can be used to monitor the memory allocation cost; based on the pressure, applications can implement diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 3240f930d5f7..b37c0d870816 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -2000,6 +2000,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, event->register_event = mem_cgroup_oom_register_event; event->unregister_event = mem_cgroup_oom_unregister_event; } else if (!strcmp(name, "memory.pressure_level")) { + pr_warn_once("pressure_level is deprecated and will be removed. " + "Please report your usecase to linux-mm-@kvack.org " + "if you depend on this functionality. \n"); event->register_event = vmpressure_register_event; event->unregister_event = vmpressure_unregister_event; } else if (!strcmp(name, "memory.memsw.usage_in_bytes")) { From 73ed0baae66df50359c876f65f41179d6ebd2716 Mon Sep 17 00:00:00 2001 From: Chris Li Date: Tue, 30 Jul 2024 23:49:13 -0700 Subject: [PATCH 151/416] mm: swap: swap cluster switch to double link list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm: swap: mTHP swap allocator base on swap cluster order", v5. This is the short term solutions "swap cluster order" listed in my "Swap Abstraction" discussion slice 8 in the recent LSF/MM conference. When commit 845982eb264bc "mm: swap: allow storage of all mTHP orders" is introduced, it only allocates the mTHP swap entries from the new empty cluster list.  It has a fragmentation issue reported by Barry. https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ The reason is that all the empty clusters have been exhausted while there are plenty of free swap entries in the cluster that are not 100% free. Remember the swap allocation order in the cluster. Keep track of the per order non full cluster list for later allocation. This series gives the swap SSD allocation a new separate code path from the HDD allocation. The new allocator use cluster list only and do not global scan swap_map[] without lock any more. This streamline the swap allocation for SSD. The code matches the execution flow much better. User impact: For users that allocate and free mix order mTHP swapping, It greatly improves the success rate of the mTHP swap allocation after the initial phase. It also performs faster when the swapfile is close to full, because the allocator can get the non full cluster from a list rather than scanning a lot of swap_map entries.  With Barry's mthp test program V2: Without: $ ./thp_swap_allocator_test -a Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71% Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00% ... Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% $ ./thp_swap_allocator_test -a -s Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% $ ./thp_swap_allocator_test -s Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% $ ./thp_swap_allocator_test Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00% Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00% Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% .. Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00% Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00% Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00% With: # with all 0.00% filter out $ ./thp_swap_allocator_test -a | grep -v "0.00%" $ # all result are 0.00% $ ./thp_swap_allocator_test -a -s | grep -v "0.00%" ./thp_swap_allocator_test -a -s | grep -v "0.00%" Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33% Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10% Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51% Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72% Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81% Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63% Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90% Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15% Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88% Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46% Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49% Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75% Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28% Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35% Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45% Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93% Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38% Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88% Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45% Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64% Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44% Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30% $ ./thp_swap_allocator_test ./thp_swap_allocator_test Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% $ ./thp_swap_allocator_test -s ./thp_swap_allocator_test -s Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19% Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05% Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85% Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67% Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18% Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96% Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38% Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11% Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13% Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17% Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84% Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60% Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00% Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67% Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24% ========= Kernel compile under tmpfs with cgroup memory.max = 470M. 12 core 24 hyperthreading, 32 jobs. 10 Run each group SSD swap 10 runs average, 20G swap partition: With: user 2929.064 system 1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27 1504.47 1531.4 1532.92 1551.57 real 1441.324 Without: user 2910.872 system 1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3 1470.19 1496.32 1544.1 1559.01 real 1580.822 Two zram swap: zram0 3.0G zram1 20G. The idea is forcing the zram0 almost full then overflow to zram1: With: user 4320.301 system 4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06 4279.85 4285.98 4289.64 4293.13 real 431.759 Without user 4301.393 system 4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05 4389.76 4397.16 4398.23 4403.9 real 433.979 ------ more test result from Kaiui ---------- Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem: System info: 32 Core AMD Zen2, 64G total memory. Test 3 times using only 4K pages: ================================= With: ----- 1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k 1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k 1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k Summary (~4.6% improment of system time): User: 1839.62 System: 2443.89: 2465.77 2454.68 2411.21 Real: 158.88 Without: -------- 1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k 1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k 1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k Summary: User: 1839.78 System: 2564.22: 2575.95 2555.15 2561.55 Real: 162.99 Test 5 times using enabled all mTHP pages: ========================================== With: ----- 1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k 1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k 1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k 1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k 1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k Summary (~8.4% improvement of system time): User: 1803.25 System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08 Real: 175.90 mTHP swapout status: /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721 /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110 /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365 /sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269 /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24 /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341 /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145 /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038 /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737 /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808 /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455 /sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010 /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973 /sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223 /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348 /sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541 Without: -------- 1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k 1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k 1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k 1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k 1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k Summary: User: 1811.61 System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40 Real: 185.356 mTHP swapout status: hugepages-32kB/stats/swpout:35630 hugepages-32kB/stats/swpout_fallback:1809908 hugepages-512kB/stats/swpout:523 hugepages-512kB/stats/swpout_fallback:55235 hugepages-2048kB/stats/swpout:53 hugepages-2048kB/stats/swpout_fallback:17264 hugepages-1024kB/stats/swpout:85 hugepages-1024kB/stats/swpout_fallback:24979 hugepages-64kB/stats/swpout:30117 hugepages-64kB/stats/swpout_fallback:1825399 hugepages-16kB/stats/swpout:42775 hugepages-16kB/stats/swpout_fallback:1951123 hugepages-256kB/stats/swpout:2326 hugepages-256kB/stats/swpout_fallback:170165 hugepages-128kB/stats/swpout:17925 hugepages-128kB/stats/swpout_fallback:1309757 This patch (of 9): Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. It prevents puting the non free cluster into a list. Change the cluster to use the standard double link list instead. This allows tracing the nonfull cluster in the follow up patch. That way, it is faster to get to the nonfull cluster of that order. Remove the cluster getter/setter for accessing the cluster struct member. The list operation is protected by the swap_info_struct->lock. Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand. Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well. The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap_cluster_info struct, it has very little impact on the average memory usage per swap entry. For 1TB swapfile, the swap cluster data structure increases from 8MB to 24MB. Other than the list conversion, there is no real function change in this patch. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-1-cb9c148b9297@kernel.org Signed-off-by: Chris Li Reported-by: Barry Song <21cnbao@gmail.com> Reviewed-by: "Huang, Ying" Cc: Hugh Dickins Cc: Kairui Song Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- include/linux/swap.h | 25 ++--- mm/swapfile.c | 226 ++++++++++++------------------------------- 2 files changed, 71 insertions(+), 180 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 5b920fa2315b..ead568f04f81 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -243,22 +243,20 @@ enum { * free clusters are organized into a list. We fetch an entry from the list to * get a free cluster. * - * The data field stores next cluster if the cluster is free or cluster usage - * counter otherwise. The flags field determines if a cluster is free. This is - * protected by swap_info_struct.lock. + * The flags field determines if a cluster is free. This is + * protected by cluster lock. */ struct swap_cluster_info { spinlock_t lock; /* * Protect swap_cluster_info fields - * and swap_info_struct->swap_map - * elements correspond to the swap - * cluster + * other than list, and swap_info_struct->swap_map + * elements corresponding to the swap cluster. */ - unsigned int data:24; - unsigned int flags:8; + u16 count; + u8 flags; + struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ /* * The first page in the swap file is the swap header, which is always marked @@ -283,11 +281,6 @@ struct percpu_cluster { unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; -struct swap_cluster_list { - struct swap_cluster_info head; - struct swap_cluster_info tail; -}; - /* * The in-memory structure used to track swap areas. */ @@ -300,7 +293,7 @@ struct swap_info_struct { unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ - struct swap_cluster_list free_clusters; /* free clusters list */ + struct list_head free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ @@ -331,7 +324,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ - struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one * entry per node. diff --git a/mm/swapfile.c b/mm/swapfile.c index 02bb4986a2c2..46d98d804451 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si, #endif #define LATENCY_LIMIT 256 -static inline void cluster_set_flag(struct swap_cluster_info *info, - unsigned int flag) -{ - info->flags = flag; -} - -static inline unsigned int cluster_count(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_count(struct swap_cluster_info *info, - unsigned int c) -{ - info->data = c; -} - -static inline void cluster_set_count_flag(struct swap_cluster_info *info, - unsigned int c, unsigned int f) -{ - info->flags = f; - info->data = c; -} - -static inline unsigned int cluster_next(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_next(struct swap_cluster_info *info, - unsigned int n) -{ - info->data = n; -} - -static inline void cluster_set_next_flag(struct swap_cluster_info *info, - unsigned int n, unsigned int f) -{ - info->flags = f; - info->data = n; -} - static inline bool cluster_is_free(struct swap_cluster_info *info) { return info->flags & CLUSTER_FLAG_FREE; } -static inline bool cluster_is_null(struct swap_cluster_info *info) +static inline unsigned int cluster_index(struct swap_info_struct *si, + struct swap_cluster_info *ci) { - return info->flags & CLUSTER_FLAG_NEXT_NULL; -} - -static inline void cluster_set_null(struct swap_cluster_info *info) -{ - info->flags = CLUSTER_FLAG_NEXT_NULL; - info->data = 0; + return ci - si->cluster_info; } static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, spin_unlock(&si->lock); } -static inline bool cluster_list_empty(struct swap_cluster_list *list) -{ - return cluster_is_null(&list->head); -} - -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) -{ - return cluster_next(&list->head); -} - -static void cluster_list_init(struct swap_cluster_list *list) -{ - cluster_set_null(&list->head); - cluster_set_null(&list->tail); -} - -static void cluster_list_add_tail(struct swap_cluster_list *list, - struct swap_cluster_info *ci, - unsigned int idx) -{ - if (cluster_list_empty(list)) { - cluster_set_next_flag(&list->head, idx, 0); - cluster_set_next_flag(&list->tail, idx, 0); - } else { - struct swap_cluster_info *ci_tail; - unsigned int tail = cluster_next(&list->tail); - - /* - * Nested cluster lock, but both cluster locks are - * only acquired when we held swap_info_struct->lock - */ - ci_tail = ci + tail; - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); - cluster_set_next(ci_tail, idx); - spin_unlock(&ci_tail->lock); - cluster_set_next_flag(&list->tail, idx, 0); - } -} - -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, - struct swap_cluster_info *ci) -{ - unsigned int idx; - - idx = cluster_next(&list->head); - if (cluster_next(&list->tail) == idx) { - cluster_set_null(&list->head); - cluster_set_null(&list->tail); - } else - cluster_set_next_flag(&list->head, - cluster_next(&ci[idx]), 0); - - return idx; -} - /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, - unsigned int idx) + struct swap_cluster_info *ci) { + unsigned int idx = cluster_index(si, ci); /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); - + list_add_tail(&ci->list, &si->discard_clusters); schedule_work(&si->discard_work); } -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info; - - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, ci, idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &si->free_clusters); } /* @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) */ static void swap_do_scheduled_discard(struct swap_info_struct *si) { - struct swap_cluster_info *info, *ci; + struct swap_cluster_info *ci; unsigned int idx; - info = si->cluster_info; - - while (!cluster_list_empty(&si->discard_clusters)) { - idx = cluster_list_del_first(&si->discard_clusters, info); + while (!list_empty(&si->discard_clusters)) { + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + idx = cluster_index(si, ci); spin_unlock(&si->lock); discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); spin_lock(&si->lock); - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); - __free_cluster(si, idx); + + spin_lock(&ci->lock); + __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); - unlock_cluster(ci); + spin_unlock(&ci->lock); } } @@ -521,20 +418,21 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) { - struct swap_cluster_info *ci = si->cluster_info; + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, + struct swap_cluster_info, list); - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); - cluster_list_del_first(&si->free_clusters, ci); - cluster_set_count_flag(ci + idx, 0, 0); + VM_BUG_ON(cluster_index(si, ci) != idx); + list_del(&ci->list); + ci->count = 0; + ci->flags = 0; + return ci; } -static void free_cluster(struct swap_info_struct *si, unsigned long idx) +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info + idx; - - VM_BUG_ON(cluster_count(ci) != 0); + VM_BUG_ON(ci->count != 0); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -542,11 +440,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) */ if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == (SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(si, idx); + swap_cluster_schedule_discard(si, ci); return; } - __free_cluster(si, idx); + __free_cluster(si, ci); } /* @@ -559,15 +457,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci = cluster_info + idx; if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) + if (cluster_is_free(ci)) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + count); + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); + ci->count += count; } /* @@ -581,24 +479,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster corresponding to page_nr decreases one usage. If the usage - * counter becomes 0, which means no page in the cluster is in using, we can - * optionally discard the cluster and add it to free cluster list. + * The cluster ci decreases one usage. If the usage counter becomes 0, + * which means no page in the cluster is in use, we can optionally discard + * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) { - unsigned long idx = page_nr / SWAPFILE_CLUSTER; - - if (!cluster_info) + if (!p->cluster_info) return; - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) - 1); + VM_BUG_ON(ci->count == 0); + ci->count--; - if (cluster_count(&cluster_info[idx]) == 0) - free_cluster(p, idx); + if (!ci->count) + free_cluster(p, ci); } /* @@ -611,10 +505,12 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, { struct percpu_cluster *percpu_cluster; bool conflict; + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, + struct swap_cluster_info, list); offset /= SWAPFILE_CLUSTER; - conflict = !cluster_list_empty(&si->free_clusters) && - offset != cluster_list_first(&si->free_clusters) && + conflict = !list_empty(&si->free_clusters) && + offset != cluster_index(si, first) && cluster_is_free(&si->cluster_info[offset]); if (!conflict) @@ -655,10 +551,10 @@ new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { - if (!cluster_list_empty(&si->free_clusters)) { - tmp = cluster_next(&si->free_clusters.head) * - SWAPFILE_CLUSTER; - } else if (!cluster_list_empty(&si->discard_clusters)) { + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then @@ -1062,8 +958,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - cluster_set_count_flag(ci, 0, 0); - free_cluster(si, idx); + ci->count = 0; + ci->flags = 0; + free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); } @@ -1336,7 +1233,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) count = p->swap_map[offset]; VM_BUG_ON(count != SWAP_HAS_CACHE); p->swap_map[offset] = 0; - dec_cluster_info_page(p, p->cluster_info, offset); + dec_cluster_info_page(p, ci); unlock_cluster(ci); mem_cgroup_uncharge_swap(entry, 1); @@ -3011,8 +2908,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ - cluster_list_init(&p->free_clusters); - cluster_list_init(&p->discard_clusters); + INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; @@ -3063,14 +2960,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (k = 0; k < SWAP_CLUSTER_COLS; k++) { j = (k + col) % SWAP_CLUSTER_COLS; for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { + struct swap_cluster_info *ci; idx = i * SWAP_CLUSTER_COLS + j; + ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (cluster_count(&cluster_info[idx])) + if (ci->count) continue; - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, - idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &p->free_clusters); } } return nr_extents; From d07a46a4ac18786e7f4c98fb08525ed80dd1f642 Mon Sep 17 00:00:00 2001 From: Chris Li Date: Tue, 30 Jul 2024 23:49:14 -0700 Subject: [PATCH 152/416] mm: swap: mTHP allocate swap entries from nonfull list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list. The cluster will remember which order it was used during new cluster allocation. When the cluster has free entry, add to the nonfull[order] list.  When the free cluster list is empty, also allocate from the nonempty list of that order. This improves the mTHP swap allocation success rate. There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org Signed-off-by: Chris Li Reported-by: Barry Song <21cnbao@gmail.com> Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kairui Song Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- include/linux/swap.h | 4 ++++ mm/swapfile.c | 38 +++++++++++++++++++++++++++++++++++--- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ead568f04f81..7007a3c2336c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,9 +254,11 @@ struct swap_cluster_info { */ u16 count; u8 flags; + u8 order; struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ /* * The first page in the swap file is the swap header, which is always marked @@ -294,6 +296,8 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; + /* list of cluster that contains at least one free slot */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 46d98d804451..b98df62c6c43 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -361,14 +361,22 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - list_add_tail(&ci->list, &si->discard_clusters); + VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->discard_clusters); + else + list_add_tail(&ci->list, &si->discard_clusters); + ci->flags = 0; schedule_work(&si->discard_work); } static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->free_clusters); + else + list_add_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; - list_add_tail(&ci->list, &si->free_clusters); } /* @@ -491,8 +499,15 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste VM_BUG_ON(ci->count == 0); ci->count--; - if (!ci->count) + if (!ci->count) { free_cluster(p, ci); + return; + } + + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + ci->flags |= CLUSTER_FLAG_NONFULL; + } } /* @@ -553,6 +568,19 @@ new_cluster: if (tmp == SWAP_NEXT_INVALID) { if (!list_empty(&si->free_clusters)) { ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->order = order; + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], + struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->flags = 0; + spin_unlock(&ci->lock); tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; } else if (!list_empty(&si->discard_clusters)) { /* @@ -959,6 +987,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; + ci->order = 0; ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); @@ -2911,6 +2940,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); + for (i = 0; i < SWAP_NR_ORDERS; i++) + INIT_LIST_HEAD(&p->nonfull_clusters[i]); + for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) From 5f843a9a3a1e865fbf349419bde39977c2e7d3d1 Mon Sep 17 00:00:00 2001 From: Chris Li Date: Tue, 30 Jul 2024 23:49:15 -0700 Subject: [PATCH 153/416] mm: swap: separate SSD allocation from scan_swap_map_slots() Previously the SSD and HDD share the same swap_map scan loop in scan_swap_map_slots(). This function is complex and hard to flow the execution flow. scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting to locate the candidate swap range in the cluster. However it needs to go back to scan_swap_map_slots() to check conflict and then perform the allocation. When scan_swap_map_try_ssd_cluster() failed, it still depended on the scan_swap_map_slots() to do brute force scanning of the swap_map. When the swapfile is large and almost full, it will take some CPU time to go through the swap_map array. Get rid of the cluster allocation dependency on the swap_map scan loop in scan_swap_map_slots(). Streamline the cluster allocation code path. No more conflict checks. For order 0 swap entry, when run out of free and nonfull list. It will allocate from the higher order nonfull cluster list. Users should see less CPU time spent on searching the free swap slot when swapfile is almost full. [ryncsn@gmail.com: fix array-bounds error with CONFIG_THP_SWAP=n] Link: https://lkml.kernel.org/r/CAMgjq7Bz0DY+rY0XgCoH7-Q=uHLdo3omi8kUr4ePDweNyofsbQ@mail.gmail.com Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-3-cb9c148b9297@kernel.org Signed-off-by: Chris Li Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/swapfile.c | 312 ++++++++++++++++++++++++++++---------------------- 1 file changed, 174 insertions(+), 138 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b98df62c6c43..f37029997dbf 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,6 +53,8 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); +static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, + unsigned int nr_entries); static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si, return ci - si->cluster_info; } +static inline unsigned int cluster_offset(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + return cluster_index(si, ci) * SWAPFILE_CLUSTER; +} + static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -372,11 +380,15 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); + if (ci->flags & CLUSTER_FLAG_NONFULL) list_move_tail(&ci->list, &si->free_clusters); else list_add_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; + ci->order = 0; } /* @@ -431,9 +443,11 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); VM_BUG_ON(cluster_index(si, ci) != idx); + VM_BUG_ON(ci->count); list_del(&ci->list); - ci->count = 0; ci->flags = 0; return ci; } @@ -441,6 +455,8 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { VM_BUG_ON(ci->count != 0); + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -497,6 +513,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste return; VM_BUG_ON(ci->count == 0); + VM_BUG_ON(cluster_is_free(ci)); + lockdep_assert_held(&p->lock); + lockdep_assert_held(&ci->lock); ci->count--; if (!ci->count) { @@ -505,121 +524,155 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste } if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); - ci->flags |= CLUSTER_FLAG_NONFULL; + ci->flags = CLUSTER_FLAG_NONFULL; } } -/* - * It's possible scan_swap_map_slots() uses a free cluster in the middle of free - * cluster list. Avoiding such abuse to avoid list corruption. - */ -static bool -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset, int order) +static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start, + unsigned int nr_pages) { - struct percpu_cluster *percpu_cluster; - bool conflict; - struct swap_cluster_info *first = list_first_entry(&si->free_clusters, - struct swap_cluster_info, list); + unsigned char *p = si->swap_map + start; + unsigned char *end = p + nr_pages; - offset /= SWAPFILE_CLUSTER; - conflict = !list_empty(&si->free_clusters) && - offset != cluster_index(si, first) && - cluster_is_free(&si->cluster_info[offset]); - - if (!conflict) - return false; - - percpu_cluster = this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next[order] = SWAP_NEXT_INVALID; - return true; -} - -static inline bool swap_range_empty(char *swap_map, unsigned int start, - unsigned int nr_pages) -{ - unsigned int i; - - for (i = 0; i < nr_pages; i++) { - if (swap_map[start + i]) + while (p < end) + if (*p++) return false; - } return true; } + +static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) +{ + unsigned int nr_pages = 1 << order; + + if (cluster_is_free(ci)) { + if (nr_pages < SWAPFILE_CLUSTER) { + list_move_tail(&ci->list, &si->nonfull_clusters[order]); + ci->flags = CLUSTER_FLAG_NONFULL; + } + ci->order = order; + } + + memset(si->swap_map + start, usage, nr_pages); + swap_range_alloc(si, start, nr_pages); + ci->count += nr_pages; + + if (ci->count == SWAPFILE_CLUSTER) { + VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL))); + list_del(&ci->list); + ci->flags = 0; + } +} + +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset, + unsigned int *foundp, unsigned int order, + unsigned char usage) +{ + unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1); + unsigned long end = min(start + SWAPFILE_CLUSTER, si->max); + unsigned int nr_pages = 1 << order; + struct swap_cluster_info *ci; + + if (end < nr_pages) + return SWAP_NEXT_INVALID; + end -= nr_pages; + + ci = lock_cluster(si, offset); + if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + offset = SWAP_NEXT_INVALID; + goto done; + } + + while (offset <= end) { + if (cluster_scan_range(si, offset, nr_pages)) { + cluster_alloc_range(si, ci, offset, usage, order); + *foundp = offset; + if (ci->count == SWAPFILE_CLUSTER) { + offset = SWAP_NEXT_INVALID; + goto done; + } + offset += nr_pages; + break; + } + offset += nr_pages; + } + if (offset > end) + offset = SWAP_NEXT_INVALID; +done: + unlock_cluster(ci); + return offset; +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU * too. */ -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base, int order) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order, + unsigned char usage) { - unsigned int nr_pages = 1 << order; struct percpu_cluster *cluster; - struct swap_cluster_info *ci; - unsigned int tmp, max; + struct swap_cluster_info *ci, *n; + unsigned int offset, found = 0; new_cluster: + lockdep_assert_held(&si->lock); cluster = this_cpu_ptr(si->percpu_cluster); - tmp = cluster->next[order]; - if (tmp == SWAP_NEXT_INVALID) { - if (!list_empty(&si->free_clusters)) { - ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->order = order; - ci->flags = 0; - spin_unlock(&ci->lock); - tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->nonfull_clusters[order])) { - ci = list_first_entry(&si->nonfull_clusters[order], - struct swap_cluster_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->flags = 0; - spin_unlock(&ci->lock); - tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); - *scan_base = this_cpu_read(*si->cluster_next_cpu); - *offset = *scan_base; - goto new_cluster; - } else - return false; + offset = cluster->next[order]; + if (offset) { + offset = alloc_swap_scan_cluster(si, offset, &found, order, usage); + if (found) + goto done; } - /* - * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster, maintaining - * natural alignment. - */ - max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); - if (tmp < max) { - ci = lock_cluster(si, tmp); - while (tmp < max) { - if (swap_range_empty(si->swap_map, tmp, nr_pages)) - break; - tmp += nr_pages; - } - unlock_cluster(ci); + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + VM_BUG_ON(!found); + goto done; } - if (tmp >= max) { - cluster->next[order] = SWAP_NEXT_INVALID; + + if (order < PMD_ORDER) { + list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + if (found) + goto done; + } + } + + if (!list_empty(&si->discard_clusters)) { + /* + * we don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + swap_do_scheduled_discard(si); goto new_cluster; } - *offset = tmp; - *scan_base = tmp; - tmp += nr_pages; - cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; - return true; + + if (order) + goto done; + + for (int o = 1; o < SWAP_NR_ORDERS; o++) { + if (!list_empty(&si->nonfull_clusters[o])) { + ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, + list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, 0, usage); + VM_BUG_ON(!found); + goto done; + } + } + +done: + cluster->next[order] = offset; + return found; } static void __del_from_avail_list(struct swap_info_struct *p) @@ -746,11 +799,29 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, return false; } +static int cluster_alloc_swap(struct swap_info_struct *si, + unsigned char usage, int nr, + swp_entry_t slots[], int order) +{ + int n_ret = 0; + + VM_BUG_ON(!si->cluster_info); + + while (n_ret < nr) { + unsigned long offset = cluster_alloc_swap_entry(si, order, usage); + + if (!offset) + break; + slots[n_ret++] = swp_entry(si->type, offset); + } + + return n_ret; +} + static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { - struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster = 0; @@ -789,26 +860,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return 0; } + if (si->cluster_info) + return cluster_alloc_swap(si, usage, nr, slots, order); + si->flags += SWP_SCANNING; - /* - * Use percpu scan base for SSD to reduce lock contention on - * cluster and swap cache. For HDD, sequential access is more - * important. - */ - if (si->flags & SWP_SOLIDSTATE) - scan_base = this_cpu_read(*si->cluster_next_cpu); - else - scan_base = si->cluster_next; + + /* For HDD, sequential access is more important. */ + scan_base = si->cluster_next; offset = scan_base; - /* SSD algorithm */ - if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } else if (unlikely(!si->cluster_nr--)) { + if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; @@ -819,8 +880,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* * If seek is expensive, start searching for new cluster from * start of partition, to minimize the span of allocated swap. - * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info - * case, just handled by scan_swap_map_try_ssd_cluster() above. */ scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; @@ -848,19 +907,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } checks: - if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { - /* take a break if we already got some slots */ - if (n_ret) - goto done; - if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } - } if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) @@ -868,11 +914,9 @@ checks: if (offset > si->highest_bit) scan_base = offset = si->lowest_bit; - ci = lock_cluster(si, offset); /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; - unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); @@ -883,15 +927,12 @@ checks: } if (si->swap_map[offset]) { - unlock_cluster(ci); if (!n_ret) goto scan; else goto done; } memset(si->swap_map + offset, usage, nr_pages); - add_cluster_info_page(si, si->cluster_info, offset, nr_pages); - unlock_cluster(ci); swap_range_alloc(si, offset, nr_pages); slots[n_ret++] = swp_entry(si->type, offset); @@ -912,13 +953,7 @@ checks: latency_ration = LATENCY_LIMIT; } - /* try to get more slots in cluster */ - if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) - goto checks; - if (order > 0) - goto done; - } else if (si->cluster_nr && !si->swap_map[++offset]) { + if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; goto checks; @@ -987,8 +1022,6 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; - ci->order = 0; - ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); @@ -2997,8 +3030,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (ci->count) + if (ci->count) { + ci->flags = CLUSTER_FLAG_NONFULL; + list_add_tail(&ci->list, &p->nonfull_clusters[0]); continue; + } ci->flags = CLUSTER_FLAG_FREE; list_add_tail(&ci->list, &p->free_clusters); } From 3b2561b5daeb3531c011491e9a6d2b934cc8f49f Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:16 -0700 Subject: [PATCH 154/416] mm: swap: clean up initialization helper At this point, alloc_cluster is never called already, and inc_cluster_info_page is called by initialization only, a lot of dead code can be dropped. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-4-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/swapfile.c | 52 ++++++++++++++------------------------------------- 1 file changed, 14 insertions(+), 38 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f37029997dbf..513ac75d4fc0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -438,20 +438,6 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) -{ - struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, - struct swap_cluster_info, list); - - lockdep_assert_held(&si->lock); - lockdep_assert_held(&ci->lock); - VM_BUG_ON(cluster_index(si, ci) != idx); - VM_BUG_ON(ci->count); - list_del(&ci->list); - ci->flags = 0; - return ci; -} - static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { VM_BUG_ON(ci->count != 0); @@ -472,34 +458,24 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * } /* - * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased by - * count. - */ -static void add_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr, - unsigned long count) -{ - unsigned long idx = page_nr / SWAPFILE_CLUSTER; - struct swap_cluster_info *ci = cluster_info + idx; - - if (!cluster_info) - return; - if (cluster_is_free(ci)) - alloc_cluster(p, idx); - - VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); - ci->count += count; -} - -/* - * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased by 1. + * The cluster corresponding to page_nr will be used. The cluster will not be + * added to free cluster list and its usage counter will be increased by 1. + * Only used for initialization. */ static void inc_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *cluster_info, unsigned long page_nr) { - add_cluster_info_page(p, cluster_info, page_nr, 1); + unsigned long idx = page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci; + + if (!cluster_info) + return; + + ci = cluster_info + idx; + ci->count++; + + VM_BUG_ON(ci->count > SWAPFILE_CLUSTER); + VM_BUG_ON(ci->flags); } /* From 650975d2b181e30c9017c42cb3f6535287555b1e Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:17 -0700 Subject: [PATCH 155/416] mm: swap: skip slot cache on freeing for mTHP Currently when we are freeing mTHP folios from swap cache, we free then one by one and put each entry into swap slot cache. Slot cache is designed to reduce the overhead by batching the freeing, but mTHP swap entries are already continuous so they can be batch freed without it already, it saves litle overhead, or even increase overhead for larger mTHP. What's more, mTHP entries could stay in swap cache for a while. Contiguous swap entry is an rather rare resource so releasing them directly can help improve mTHP allocation success rate when under pressure. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-5-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Acked-by: Barry Song Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/swapfile.c | 59 +++++++++++++++++++++++---------------------------- 1 file changed, 26 insertions(+), 33 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 513ac75d4fc0..1a3d5efd8a10 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster ci decreases one usage. If the usage counter becomes 0, + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0, * which means no page in the cluster is in use, we can optionally discard * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) +static void dec_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *ci, int nr_pages) { if (!p->cluster_info) return; - VM_BUG_ON(ci->count == 0); + VM_BUG_ON(ci->count < nr_pages); VM_BUG_ON(cluster_is_free(ci)); lockdep_assert_held(&p->lock); lockdep_assert_held(&ci->lock); - ci->count--; + ci->count -= nr_pages; if (!ci->count) { free_cluster(p, ci); @@ -990,19 +991,6 @@ no_page: return n_ret; } -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) -{ - unsigned long offset = idx * SWAPFILE_CLUSTER; - struct swap_cluster_info *ci; - - ci = lock_cluster(si, offset); - memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - ci->count = 0; - free_cluster(si, ci); - unlock_cluster(ci); - swap_range_free(si, offset, SWAPFILE_CLUSTER); -} - int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order = swap_entry_order(entry_order); @@ -1261,21 +1249,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, return usage; } -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) +/* + * Drop the last HAS_CACHE flag of swap entries, caller have to + * ensure all entries belong to the same cgroup. + */ +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry, + unsigned int nr_pages) { - struct swap_cluster_info *ci; unsigned long offset = swp_offset(entry); - unsigned char count; + unsigned char *map = p->swap_map + offset; + unsigned char *map_end = map + nr_pages; + struct swap_cluster_info *ci; ci = lock_cluster(p, offset); - count = p->swap_map[offset]; - VM_BUG_ON(count != SWAP_HAS_CACHE); - p->swap_map[offset] = 0; - dec_cluster_info_page(p, ci); + do { + VM_BUG_ON(*map != SWAP_HAS_CACHE); + *map = 0; + } while (++map < map_end); + dec_cluster_info_page(p, ci, nr_pages); unlock_cluster(ci); - mem_cgroup_uncharge_swap(entry, 1); - swap_range_free(p, offset, 1); + mem_cgroup_uncharge_swap(entry, nr_pages); + swap_range_free(p, offset, nr_pages); } static void cluster_swap_free_nr(struct swap_info_struct *sis, @@ -1336,7 +1331,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { unsigned long offset = swp_offset(entry); - unsigned long idx = offset / SWAPFILE_CLUSTER; struct swap_cluster_info *ci; struct swap_info_struct *si; unsigned char *map; @@ -1349,19 +1343,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) return; ci = lock_cluster_or_swap_info(si, offset); - if (size == SWAPFILE_CLUSTER) { + if (size > 1) { map = si->swap_map + offset; - for (i = 0; i < SWAPFILE_CLUSTER; i++) { + for (i = 0; i < size; i++) { val = map[i]; VM_BUG_ON(!(val & SWAP_HAS_CACHE)); if (val == SWAP_HAS_CACHE) free_entries++; } - if (free_entries == SWAPFILE_CLUSTER) { + if (free_entries == size) { unlock_cluster_or_swap_info(si, ci); spin_lock(&si->lock); - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); - swap_free_cluster(si, idx); + swap_entry_range_free(si, entry, size); spin_unlock(&si->lock); return; } @@ -1406,7 +1399,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n) for (i = 0; i < n; ++i) { p = swap_info_get_cont(entries[i], prev); if (p) - swap_entry_free(p, entries[i]); + swap_entry_range_free(p, entries[i], 1); prev = p; } if (p) From 862590ac3708e1cbbfb02a8ed78587b86ecba4ba Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:18 -0700 Subject: [PATCH 156/416] mm: swap: allow cache reclaim to skip slot cache Currently we free the reclaimed slots through slot cache even if the slot is required to be empty immediately. As a result the reclaim caller will see the slot still occupied even after a successful reclaim, and need to keep reclaiming until slot cache get flushed. This caused ineffective or over reclaim when SWAP is under stress. So introduce a new flag allowing the slot to be emptied bypassing the slot cache. [21cnbao@gmail.com: small folios should have nr_pages == 1 but not nr_page == 0] Link: https://lkml.kernel.org/r/20240805015324.45134-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-6-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/swapfile.c | 152 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 109 insertions(+), 43 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 1a3d5efd8a10..5e3c17c6bbd4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,8 +53,15 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry, + unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, unsigned int nr_entries); +static bool folio_swapcache_freeable(struct folio *folio); +static struct swap_cluster_info *lock_cluster_or_swap_info( + struct swap_info_struct *si, unsigned long offset); +static void unlock_cluster_or_swap_info(struct swap_info_struct *si, + struct swap_cluster_info *ci); static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent) * corresponding page */ #define TTRS_UNMAPPED 0x2 -/* Reclaim the swap entry if swap is getting full*/ +/* Reclaim the swap entry if swap is getting full */ #define TTRS_FULL 0x4 +/* Reclaim directly, bypass the slot cache and don't touch device lock */ +#define TTRS_DIRECT 0x8 + +static bool swap_is_has_cache(struct swap_info_struct *si, + unsigned long offset, int nr_pages) +{ + unsigned char *map = si->swap_map + offset; + unsigned char *map_end = map + nr_pages; + + do { + VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); + if (*map != SWAP_HAS_CACHE) + return false; + } while (++map < map_end); + + return true; +} /* * returns number of pages in the folio that backs the swap entry. If positive, @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { swp_entry_t entry = swp_entry(si->type, offset); + struct address_space *address_space = swap_address_space(entry); + struct swap_cluster_info *ci; struct folio *folio; - int ret = 0; + int ret, nr_pages; + bool need_reclaim; - folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); + folio = filemap_get_folio(address_space, swap_cache_index(entry)); if (IS_ERR(folio)) return 0; + + /* offset could point to the middle of a large folio */ + entry = folio->swap; + offset = swp_offset(entry); + nr_pages = folio_nr_pages(folio); + ret = -nr_pages; + /* * When this function is called from scan_swap_map_slots() and it's * called by vmscan.c at reclaiming folios. So we hold a folio lock @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, * case and you should use folio_free_swap() with explicit folio_lock() * in usual operations. */ - if (folio_trylock(folio)) { - if ((flags & TTRS_ANYWAY) || - ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) || - ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio))) - ret = folio_free_swap(folio); - folio_unlock(folio); + if (!folio_trylock(folio)) + goto out; + + need_reclaim = ((flags & TTRS_ANYWAY) || + ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) || + ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio))); + if (!need_reclaim || !folio_swapcache_freeable(folio)) + goto out_unlock; + + /* + * It's safe to delete the folio from swap cache only if the folio's + * swap_map is HAS_CACHE only, which means the slots have no page table + * reference or pending writeback, and can't be allocated to others. + */ + ci = lock_cluster_or_swap_info(si, offset); + need_reclaim = swap_is_has_cache(si, offset, nr_pages); + unlock_cluster_or_swap_info(si, ci); + if (!need_reclaim) + goto out_unlock; + + if (!(flags & TTRS_DIRECT)) { + /* Free through slot cache */ + delete_from_swap_cache(folio); + folio_set_dirty(folio); + ret = nr_pages; + goto out_unlock; } - ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); + + xa_lock_irq(&address_space->i_pages); + __delete_from_swap_cache(folio, entry, NULL); + xa_unlock_irq(&address_space->i_pages); + folio_ref_sub(folio, nr_pages); + folio_set_dirty(folio); + + spin_lock(&si->lock); + /* Only sinple page folio can be backed by zswap */ + if (nr_pages == 1) + zswap_invalidate(entry); + swap_entry_range_free(si, entry, nr_pages); + spin_unlock(&si->lock); + ret = nr_pages; +out_unlock: + folio_unlock(folio); +out: folio_put(folio); return ret; } @@ -895,7 +965,7 @@ checks: if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; spin_unlock(&si->lock); - swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ if (swap_was_freed > 0) @@ -1333,9 +1403,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) unsigned long offset = swp_offset(entry); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned char *map; - unsigned int i, free_entries = 0; - unsigned char val; int size = 1 << swap_entry_order(folio_order(folio)); si = _swap_info_get(entry); @@ -1343,23 +1410,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) return; ci = lock_cluster_or_swap_info(si, offset); - if (size > 1) { - map = si->swap_map + offset; - for (i = 0; i < size; i++) { - val = map[i]; - VM_BUG_ON(!(val & SWAP_HAS_CACHE)); - if (val == SWAP_HAS_CACHE) - free_entries++; - } - if (free_entries == size) { - unlock_cluster_or_swap_info(si, ci); - spin_lock(&si->lock); - swap_entry_range_free(si, entry, size); - spin_unlock(&si->lock); - return; - } + if (size > 1 && swap_is_has_cache(si, offset, size)) { + unlock_cluster_or_swap_info(si, ci); + spin_lock(&si->lock); + swap_entry_range_free(si, entry, size); + spin_unlock(&si->lock); + return; } - for (i = 0; i < size; i++, entry.val++) { + for (int i = 0; i < size; i++, entry.val++) { if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { unlock_cluster_or_swap_info(si, ci); free_swap_slot(entry); @@ -1519,16 +1577,7 @@ static bool folio_swapped(struct folio *folio) return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); } -/** - * folio_free_swap() - Free the swap space used for this folio. - * @folio: The folio to remove. - * - * If swap is getting full, or if there are no more mappings of this folio, - * then call folio_free_swap to free its swap space. - * - * Return: true if we were able to release the swap space. - */ -bool folio_free_swap(struct folio *folio) +static bool folio_swapcache_freeable(struct folio *folio) { VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -1536,8 +1585,6 @@ bool folio_free_swap(struct folio *folio) return false; if (folio_test_writeback(folio)) return false; - if (folio_swapped(folio)) - return false; /* * Once hibernation has begun to create its image of memory, @@ -1557,6 +1604,25 @@ bool folio_free_swap(struct folio *folio) if (pm_suspended_storage()) return false; + return true; +} + +/** + * folio_free_swap() - Free the swap space used for this folio. + * @folio: The folio to remove. + * + * If swap is getting full, or if there are no more mappings of this folio, + * then call folio_free_swap to free its swap space. + * + * Return: true if we were able to release the swap space. + */ +bool folio_free_swap(struct folio *folio) +{ + if (!folio_swapcache_freeable(folio)) + return false; + if (folio_swapped(folio)) + return false; + delete_from_swap_cache(folio); folio_set_dirty(folio); return true; @@ -1633,7 +1699,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) * to the next boundary. */ nr = __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); + TTRS_UNMAPPED | TTRS_FULL); if (nr == 0) nr = 1; else if (nr < 0) From 477cb7ba28892eda112c79d8f75d10edabfc3050 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:19 -0700 Subject: [PATCH 157/416] mm: swap: add a fragment cluster list Now swap cluster allocator arranges the clusters in LRU style, so the "cold" cluster stay at the head of nonfull lists are the ones that were used for allocation long time ago and still partially occupied. So if allocator can't find enough contiguous slots to satisfy an high order allocation, it's unlikely there will be slot being free on them to satisfy the allocation, at least in a short period. As a result, nonfull cluster scanning will waste time repeatly scanning the unusable head of the list. Also, multiple CPUs could content on the same head cluster of nonfull list. Unlike free clusters which are removed from the list when any CPU starts using it, nonfull cluster stays on the head. So introduce a new list frag list, all scanned nonfull clusters will be moved to this list. Both for avoiding repeated scanning and contention. Frag list is still used as fallback for allocations, so if one CPU failed to allocate one order of slots, it can still steal other CPU's clusters. And order 0 will favor the fragmented clusters to better protect nonfull clusters If any slots on a fragment list are being freed, move the fragment list back to nonfull list indicating it worth another scan on the cluster. Compared to scan upon freeing a slot, this keep the scanning lazy and save some CPU if there are still other clusters to use. It may seems unneccessay to keep the fragmented cluster on list at all if they can't be used for specific order allocation. But this will start to make sense once reclaim dring scanning is ready. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-7-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- include/linux/swap.h | 3 +++ mm/swapfile.c | 41 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 40 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7007a3c2336c..c526cd3e7b31 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -259,6 +259,7 @@ struct swap_cluster_info { }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ +#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */ /* * The first page in the swap file is the swap header, which is always marked @@ -298,6 +299,8 @@ struct swap_info_struct { struct list_head free_clusters; /* free clusters list */ struct list_head nonfull_clusters[SWAP_NR_ORDERS]; /* list of cluster that contains at least one free slot */ + struct list_head frag_clusters[SWAP_NR_ORDERS]; + /* list of cluster that are fragmented or contented */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 5e3c17c6bbd4..b0b726a9bf0a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -572,7 +572,10 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + if (ci->flags & CLUSTER_FLAG_FRAG) + list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); + else + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); ci->flags = CLUSTER_FLAG_NONFULL; } } @@ -610,7 +613,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_ ci->count += nr_pages; if (ci->count == SWAPFILE_CLUSTER) { - VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL))); + VM_BUG_ON(!(ci->flags & + (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); list_del(&ci->list); ci->flags = 0; } @@ -666,6 +670,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o struct percpu_cluster *cluster; struct swap_cluster_info *ci, *n; unsigned int offset, found = 0; + LIST_HEAD(fraged); new_cluster: lockdep_assert_held(&si->lock); @@ -686,13 +691,29 @@ new_cluster: if (order < PMD_ORDER) { list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { + list_move_tail(&ci->list, &fraged); + ci->flags = CLUSTER_FLAG_FRAG; offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); if (found) - goto done; + break; } + + if (!found) { + list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) { + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + if (found) + break; + } + } + + list_splice_tail(&fraged, &si->frag_clusters[order]); } + if (found) + goto done; + if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in @@ -706,7 +727,17 @@ new_cluster: if (order) goto done; + /* Order 0 stealing from higher order */ for (int o = 1; o < SWAP_NR_ORDERS; o++) { + if (!list_empty(&si->frag_clusters[o])) { + ci = list_first_entry(&si->frag_clusters[o], + struct swap_cluster_info, list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, + 0, usage); + VM_BUG_ON(!found); + goto done; + } + if (!list_empty(&si->nonfull_clusters[o])) { ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, list); @@ -3008,8 +3039,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); - for (i = 0; i < SWAP_NR_ORDERS; i++) + for (i = 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&p->nonfull_clusters[i]); + INIT_LIST_HEAD(&p->frag_clusters[i]); + } for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; From 661383c6111a38c88df61af6bfbcfacd2ff20a67 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:20 -0700 Subject: [PATCH 158/416] mm: swap: relaim the cached parts that got scanned This commit implements reclaim during scan for cluster allocator. Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could result in low allocation success rate or early OOM. So to ensure maximum allocation success rate, integrate reclaiming with scanning. If found a range of suitable swap slots but fragmented due to HAS_CACHE, just try to reclaim the slots. Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-8-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 + mm/swapfile.c | 144 +++++++++++++++++++++++++++++++++---------- 2 files changed, 112 insertions(+), 33 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c526cd3e7b31..e372122e4f45 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -301,6 +301,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ + unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index b0b726a9bf0a..ed5f85504f37 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -513,6 +513,10 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * VM_BUG_ON(ci->count != 0); lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); + + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; + /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -572,31 +576,84 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) + if (ci->flags & CLUSTER_FLAG_FRAG) { + p->frag_cluster_nr[ci->order]--; list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); - else + } else { list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + } ci->flags = CLUSTER_FLAG_NONFULL; } } -static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start, - unsigned int nr_pages) +static bool cluster_reclaim_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned long end) { - unsigned char *p = si->swap_map + start; - unsigned char *end = p + nr_pages; + unsigned char *map = si->swap_map; + unsigned long offset; - while (p < end) - if (*p++) + spin_unlock(&ci->lock); + spin_unlock(&si->lock); + + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0) + continue; + goto out; + default: + goto out; + } + } +out: + spin_lock(&si->lock); + spin_lock(&ci->lock); + + /* + * Recheck the range no matter reclaim succeeded or not, the slot + * could have been be freed while we are not holding the lock. + */ + for (offset = start; offset < end; offset++) + if (READ_ONCE(map[offset])) return false; return true; } +static bool cluster_scan_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned int nr_pages) +{ + unsigned long offset, end = start + nr_pages; + unsigned char *map = si->swap_map; + bool need_reclaim = false; -static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (!vm_swap_full()) + return false; + need_reclaim = true; + continue; + default: + return false; + } + } + + if (need_reclaim) + return cluster_reclaim_range(si, ci, start, end); + + return true; +} + +static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) { unsigned int nr_pages = 1 << order; @@ -615,6 +672,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_ if (ci->count == SWAPFILE_CLUSTER) { VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; list_del(&ci->list); ci->flags = 0; } @@ -640,7 +699,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne } while (offset <= end) { - if (cluster_scan_range(si, offset, nr_pages)) { + if (cluster_scan_range(si, ci, offset, nr_pages)) { cluster_alloc_range(si, ci, offset, usage, order); *foundp = offset; if (ci->count == SWAPFILE_CLUSTER) { @@ -668,9 +727,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o unsigned char usage) { struct percpu_cluster *cluster; - struct swap_cluster_info *ci, *n; + struct swap_cluster_info *ci; unsigned int offset, found = 0; - LIST_HEAD(fraged); new_cluster: lockdep_assert_held(&si->lock); @@ -690,25 +748,42 @@ new_cluster: } if (order < PMD_ORDER) { - list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { - list_move_tail(&ci->list, &fraged); + unsigned int frags = 0; + + while (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], + struct swap_cluster_info, list); + list_move_tail(&ci->list, &si->frag_clusters[order]); ci->flags = CLUSTER_FLAG_FRAG; + si->frag_cluster_nr[order]++; offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } if (!found) { - list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) { + /* + * Nonfull clusters are moved to frag tail if we reached + * here, count them too, don't over scan the frag list. + */ + while (frags < si->frag_cluster_nr[order]) { + ci = list_first_entry(&si->frag_clusters[order], + struct swap_cluster_info, list); + /* + * Rotate the frag list to iterate, they were all failing + * high order allocation or moved here due to per-CPU usage, + * this help keeping usable cluster ahead. + */ + list_move_tail(&ci->list, &si->frag_clusters[order]); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } } - - list_splice_tail(&fraged, &si->frag_clusters[order]); } if (found) @@ -729,25 +804,28 @@ new_cluster: /* Order 0 stealing from higher order */ for (int o = 1; o < SWAP_NR_ORDERS; o++) { - if (!list_empty(&si->frag_clusters[o])) { + /* + * Clusters here have at least one usable slots and can't fail order 0 + * allocation, but reclaim may drop si->lock and race with another user. + */ + while (!list_empty(&si->frag_clusters[o])) { ci = list_first_entry(&si->frag_clusters[o], struct swap_cluster_info, list); - offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, - 0, usage); - VM_BUG_ON(!found); - goto done; - } - - if (!list_empty(&si->nonfull_clusters[o])) { - ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, - list); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, 0, usage); - VM_BUG_ON(!found); - goto done; + if (found) + goto done; + } + + while (!list_empty(&si->nonfull_clusters[o])) { + ci = list_first_entry(&si->nonfull_clusters[o], + struct swap_cluster_info, list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, 0, usage); + if (found) + goto done; } } - done: cluster->next[order] = offset; return found; @@ -3042,6 +3120,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (i = 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&p->nonfull_clusters[i]); INIT_LIST_HEAD(&p->frag_clusters[i]); + p->frag_cluster_nr[i] = 0; } for (i = 0; i < swap_header->info.nr_badpages; i++) { @@ -3085,7 +3164,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, if (!cluster_info) return nr_extents; - /* * Reduce false cache line sharing between cluster_info and * sharing same address space. From 2cacbdfdee65b18f9952620e762eab043d71b564 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Tue, 30 Jul 2024 23:49:21 -0700 Subject: [PATCH 159/416] mm: swap: add a adaptive full cluster cache reclaim Link all full cluster with one full list, and reclaim from it when the allocation have ran out of all usable clusters. There are many reason a folio can end up being in the swap cache while having no swap count reference. So the best way to search for such slots is still by iterating the swap clusters. With the list as an LRU, iterating from the oldest cluster and keep them rotating is a very doable and clean way to free up potentially not inuse clusters. When any allocation failure, try reclaim and rotate only one cluster. This is adaptive for high order allocations they can tolerate fallback. So this avoids latency, and give the full cluster list an fair chance to get reclaimed. It release the usage stress for the fallback order 0 allocation or following up high order allocation. If the swap device is getting very full, reclaim more aggresively to ensure no OOM will happen. This ensures order 0 heavy workload won't go OOM as order 0 won't fail if any cluster still have any space. [ryncsn@gmail.com: fix discard of full cluster] Link: https://lkml.kernel.org/r/CAMgjq7CWwK75_2Zi5P40K08pk9iqOcuWKL6khu=x4Yg_nXaQag@mail.gmail.com Link: https://lkml.kernel.org/r/20240730-swap-allocator-v5-9-cb9c148b9297@kernel.org Signed-off-by: Kairui Song Reported-by: Barry Song <21cnbao@gmail.com> Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Cc: David Hildenbrand Cc: Kairui Song Signed-off-by: Andrew Morton --- include/linux/swap.h | 2 ++ mm/swapfile.c | 68 +++++++++++++++++++++++++++++++++++--------- 2 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e372122e4f45..1c8f844a9f0f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,6 +260,7 @@ struct swap_cluster_info { #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ #define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */ +#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */ /* * The first page in the swap file is the swap header, which is always marked @@ -297,6 +298,7 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head full_clusters; /* full clusters list */ struct list_head nonfull_clusters[SWAP_NR_ORDERS]; /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; diff --git a/mm/swapfile.c b/mm/swapfile.c index ed5f85504f37..b36a16131696 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -440,10 +440,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, SWAP_MAP_BAD, SWAPFILE_CLUSTER); VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_NONFULL) - list_move_tail(&ci->list, &si->discard_clusters); - else - list_add_tail(&ci->list, &si->discard_clusters); + list_move_tail(&ci->list, &si->discard_clusters); ci->flags = 0; schedule_work(&si->discard_work); } @@ -453,7 +450,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); - if (ci->flags & CLUSTER_FLAG_NONFULL) + if (ci->flags) list_move_tail(&ci->list, &si->free_clusters); else list_add_tail(&ci->list, &si->free_clusters); @@ -480,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) SWAPFILE_CLUSTER); spin_lock(&si->lock); - spin_lock(&ci->lock); __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, @@ -576,12 +572,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) { + if (ci->flags & CLUSTER_FLAG_FRAG) p->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); - } else { - list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); - } + list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); ci->flags = CLUSTER_FLAG_NONFULL; } } @@ -674,8 +667,8 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); if (ci->flags & CLUSTER_FLAG_FRAG) si->frag_cluster_nr[ci->order]--; - list_del(&ci->list); - ci->flags = 0; + list_move_tail(&ci->list, &si->full_clusters); + ci->flags = CLUSTER_FLAG_FULL; } } @@ -718,6 +711,46 @@ done: return offset; } +static void swap_reclaim_full_clusters(struct swap_info_struct *si) +{ + long to_scan = 1; + unsigned long offset, end; + struct swap_cluster_info *ci; + unsigned char *map = si->swap_map; + int nr_reclaim, total_reclaimed = 0; + + if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER) + to_scan = si->inuse_pages / SWAPFILE_CLUSTER; + + while (!list_empty(&si->full_clusters)) { + ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list); + list_move_tail(&ci->list, &si->full_clusters); + offset = cluster_offset(si, ci); + end = min(si->max, offset + SWAPFILE_CLUSTER); + to_scan--; + + while (offset < end) { + if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) { + spin_unlock(&si->lock); + nr_reclaim = __try_to_reclaim_swap(si, offset, + TTRS_ANYWAY | TTRS_DIRECT); + spin_lock(&si->lock); + if (nr_reclaim > 0) { + offset += nr_reclaim; + total_reclaimed += nr_reclaim; + continue; + } else if (nr_reclaim < 0) { + offset += -nr_reclaim; + continue; + } + } + offset++; + } + if (to_scan <= 0 || total_reclaimed) + break; + } +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -826,7 +859,15 @@ new_cluster: goto done; } } + done: + /* Try reclaim from full clusters if device is nearfull */ + if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) { + swap_reclaim_full_clusters(si); + if (!found && !order && si->pages != si->inuse_pages) + goto new_cluster; + } + cluster->next[order] = offset; return found; } @@ -3115,6 +3156,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->full_clusters); INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < SWAP_NR_ORDERS; i++) { From 0e8b67982b48ce77dce2cb19dfb006ecbc8755ea Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:45 +0300 Subject: [PATCH 160/416] mm: move kernel/numa.c to mm/ Patch series "mm: introduce numa_memblks", v4. Following the discussion about handling of CXL fixed memory windows on arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to the generic code so they will be available on arm64/riscv and maybe on loongarch sometime later. While it could be possible to use memblock to describe CXL memory windows, it currently lacks notion of unpopulated memory ranges and numa_memblks does implement this. Another reason to make numa_memblks generic is that both arch_numa (arm64 and riscv) and loongarch use trimmed copy of x86 code although there is no fundamental reason why the same code cannot be used on all these platforms. Having numa_memblks in mm/ will make it's interaction with ACPI and FDT more consistent and I believe will reduce maintenance burden. And with generic numa_memblks it is (almost) straightforward to enable NUMA emulation on arm64 and riscv. The first 9 commits in this series are cleanups that are not strictly related to numa_memblks. Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks and NUMA emulation to the generic code. Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22 does some aftermath cleanups. Commit 23 updates of_numa_init() to return error of no NUMA nodes were found in the device tree. Commit 24 switches arch_numa to numa_memblks. Commit 25 enables usage of phys_to_target_node() and memory_add_physaddr_to_nid() with numa_memblks. Commit 26 moves the description for numa=fake from x86 to admin-guide. [1] https://lore.kernel.org/all/20240529171236.32002-1-Jonathan.Cameron@huawei.com/ This patch (of 26): The stub functions in kernel/numa.c belong to mm/ rather than to kernel/ Link: https://lkml.kernel.org/r/20240807064110.1003856-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20240807064110.1003856-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- kernel/Makefile | 1 - mm/Makefile | 1 + {kernel => mm}/numa.c | 0 3 files changed, 1 insertion(+), 1 deletion(-) rename {kernel => mm}/numa.c (100%) diff --git a/kernel/Makefile b/kernel/Makefile index 3c13240dfc9f..87866b037fbe 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -116,7 +116,6 @@ obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o obj-$(CONFIG_HAVE_STATIC_CALL) += static_call.o obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call_inline.o obj-$(CONFIG_CFI_CLANG) += cfi.o -obj-$(CONFIG_NUMA) += numa.o obj-$(CONFIG_PERF_EVENTS) += events/ diff --git a/mm/Makefile b/mm/Makefile index 24b918260c7f..31fad7df64ad 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -117,6 +117,7 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o obj-$(CONFIG_Z3FOLD) += z3fold.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o +obj-$(CONFIG_NUMA) += numa.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o diff --git a/kernel/numa.c b/mm/numa.c similarity index 100% rename from kernel/numa.c rename to mm/numa.c From bc5c8ad3cbcb2a3986fd95f59ea0b34dd8e33f06 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:46 +0300 Subject: [PATCH 161/416] MIPS: sgi-ip27: make NODE_DATA() the same as on all other architectures sgi-ip27 is the only system that defines NODE_DATA() differently than the rest of NUMA machines. Add node_data array of struct pglist pointers that will point to __node_data[node]->pglist and redefine NODE_DATA() to use node_data array. This will allow pulling declaration of node_data to the generic mm code in the next commit. Link: https://lkml.kernel.org/r/20240807064110.1003856-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/mips/include/asm/mach-ip27/mmzone.h | 5 ++++- arch/mips/sgi-ip27/ip27-memory.c | 5 ++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h index 08c36e50a860..629c3f290203 100644 --- a/arch/mips/include/asm/mach-ip27/mmzone.h +++ b/arch/mips/include/asm/mach-ip27/mmzone.h @@ -22,7 +22,10 @@ struct node_data { extern struct node_data *__node_data[]; -#define NODE_DATA(n) (&__node_data[(n)]->pglist) #define hub_data(n) (&__node_data[(n)]->hub) +extern struct pglist_data *node_data[]; + +#define NODE_DATA(nid) (node_data[nid]) + #endif /* _ASM_MACH_MMZONE_H */ diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c index b8ca94cfb4fe..c30ef6958b97 100644 --- a/arch/mips/sgi-ip27/ip27-memory.c +++ b/arch/mips/sgi-ip27/ip27-memory.c @@ -34,8 +34,10 @@ #define SLOT_PFNSHIFT (SLOT_SHIFT - PAGE_SHIFT) #define PFN_NASIDSHFT (NASID_SHFT - PAGE_SHIFT) -struct node_data *__node_data[MAX_NUMNODES]; +struct pglist_data *node_data[MAX_NUMNODES]; +EXPORT_SYMBOL(node_data); +struct node_data *__node_data[MAX_NUMNODES]; EXPORT_SYMBOL(__node_data); static u64 gen_region_mask(void) @@ -361,6 +363,7 @@ static void __init node_mem_init(nasid_t node) */ __node_data[node] = __va(slot_freepfn << PAGE_SHIFT); memset(__node_data[node], 0, PAGE_SIZE); + node_data[node] = &__node_data[node]->pglist; NODE_DATA(node)->node_start_pfn = start_pfn; NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn; From 0c4450789cec1ffc8408c1fce9ba123318060b67 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:47 +0300 Subject: [PATCH 162/416] MIPS: sgi-ip27: ensure node_possible_map only contains valid nodes For SGI IP27 machines node_possible_map is statically set to NODE_MASK_ALL and it is not updated during NUMA initialization. Ensure that it only contains nodes present in the system. Link: https://lkml.kernel.org/r/20240807064110.1003856-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/mips/sgi-ip27/ip27-smp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/mips/sgi-ip27/ip27-smp.c b/arch/mips/sgi-ip27/ip27-smp.c index 5d2652a1d35a..62733e049570 100644 --- a/arch/mips/sgi-ip27/ip27-smp.c +++ b/arch/mips/sgi-ip27/ip27-smp.c @@ -70,11 +70,13 @@ void cpu_node_probe(void) gda_t *gdap = GDA; nodes_clear(node_online_map); + nodes_clear(node_possible_map); for (i = 0; i < MAX_NUMNODES; i++) { nasid_t nasid = gdap->g_nasidtable[i]; if (nasid == INVALID_NASID) break; node_set_online(nasid); + node_set(nasid, node_possible_map); highest = node_scan_cpus(nasid, highest); } From 6c701269ab7fe72dc8948dc92ba48fcdb93df8c9 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:48 +0300 Subject: [PATCH 163/416] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27") added HAVE_ARCH_NODEDATA_EXTENSION to sgi-ip27 to silence a compilation error that happened because sgi-ip27 didn't define array of pg_data_t as node_data like most other architectures did. After addition of node_data array that matches other architectures and after ensuring that offline nodes do not appear on node_possible_map, it is safe to drop arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION from sgi-ip27. Link: https://lkml.kernel.org/r/20240807064110.1003856-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/mips/Kconfig | 1 - arch/mips/sgi-ip27/ip27-memory.c | 10 ---------- 2 files changed, 11 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 60077e576935..ea5f3c3c31f6 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -735,7 +735,6 @@ config SGI_IP27 select WAR_R10000_LLSC select MIPS_L1_CACHE_SHIFT_7 select NUMA - select HAVE_ARCH_NODEDATA_EXTENSION help This are the SGI Origin 200, Origin 2000 and Onyx 2 Graphics workstations. To compile a Linux kernel that runs on these, say Y diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c index c30ef6958b97..eb6d2fa41a8a 100644 --- a/arch/mips/sgi-ip27/ip27-memory.c +++ b/arch/mips/sgi-ip27/ip27-memory.c @@ -426,13 +426,3 @@ void __init mem_init(void) memblock_free_all(); setup_zero_pages(); /* This comes from node 0 */ } - -pg_data_t * __init arch_alloc_nodedata(int nid) -{ - return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES); -} - -void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ - __node_data[nid] = (struct node_data *)pgdat; -} From e20bac6544bca0fd33e1e20617a9b82b003a23e1 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:49 +0300 Subject: [PATCH 164/416] MIPS: loongson64: rename __node_data to node_data Make definition of node_data match other architectures. This will allow pulling declaration of node_data to the generic mm code in the following commit. Link: https://lkml.kernel.org/r/20240807064110.1003856-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jiaxun Yang Reviewed-by: David Hildenbrand Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/mips/include/asm/mach-loongson64/mmzone.h | 4 ++-- arch/mips/loongson64/numa.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h index a3d65d37b8b5..2effd5f8ed62 100644 --- a/arch/mips/include/asm/mach-loongson64/mmzone.h +++ b/arch/mips/include/asm/mach-loongson64/mmzone.h @@ -14,9 +14,9 @@ #define pa_to_nid(addr) (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT) #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT) -extern struct pglist_data *__node_data[]; +extern struct pglist_data *node_data[]; -#define NODE_DATA(n) (__node_data[n]) +#define NODE_DATA(n) (node_data[n]) extern void __init prom_init_numa_memory(void); diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c index 68dafd6d3e25..b50ce28d2741 100644 --- a/arch/mips/loongson64/numa.c +++ b/arch/mips/loongson64/numa.c @@ -29,8 +29,8 @@ unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES]; EXPORT_SYMBOL(__node_distances); -struct pglist_data *__node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(__node_data); +struct pglist_data *node_data[MAX_NUMNODES]; +EXPORT_SYMBOL(node_data); cpumask_t __node_cpumask[MAX_NUMNODES]; EXPORT_SYMBOL(__node_cpumask); @@ -107,7 +107,7 @@ static void __init node_mem_init(unsigned int node) tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); if (tnid != node) pr_info("NODE_DATA(%d) on node %d\n", node, tnid); - __node_data[node] = nd; + node_data[node] = nd; NODE_DATA(node)->node_start_pfn = start_pfn; NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn; @@ -206,5 +206,5 @@ pg_data_t * __init arch_alloc_nodedata(int nid) void arch_refresh_nodedata(int nid, pg_data_t *pgdat) { - __node_data[nid] = pgdat; + node_data[nid] = pgdat; } From 3ac9999c5d6f73f17e55ea360934def7f59b2890 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:50 +0300 Subject: [PATCH 165/416] MIPS: loongson64: drop HAVE_ARCH_NODEDATA_EXTENSION Commit f8f9f21c7848 ("MIPS: Fix build error for loongson64 and sgi-ip27") added HAVE_ARCH_NODEDATA_EXTENSION to loongson64 to silence a compilation error that happened because loongson64 didn't define array of pg_data_t as node_data like most other architectures did. After rename of __node_data to node_data arch_alloc_nodedata() and HAVE_ARCH_NODEDATA_EXTENSION can be dropped from loongson64. Since it was the only user of HAVE_ARCH_NODEDATA_EXTENSION config option also remove this option from arch/mips/Kconfig. Link: https://lkml.kernel.org/r/20240807064110.1003856-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/mips/Kconfig | 4 ---- arch/mips/loongson64/numa.c | 10 ---------- 2 files changed, 14 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index ea5f3c3c31f6..43da6d596e2b 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -502,7 +502,6 @@ config MACH_LOONGSON64 select USE_OF select BUILTIN_DTB select PCI_HOST_GENERIC - select HAVE_ARCH_NODEDATA_EXTENSION if NUMA help This enables the support of Loongson-2/3 family of machines. @@ -2612,9 +2611,6 @@ config NUMA config SYS_SUPPORTS_NUMA bool -config HAVE_ARCH_NODEDATA_EXTENSION - bool - config RELOCATABLE bool "Relocatable kernel" depends on SYS_SUPPORTS_RELOCATABLE diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c index b50ce28d2741..64fcfaa885b6 100644 --- a/arch/mips/loongson64/numa.c +++ b/arch/mips/loongson64/numa.c @@ -198,13 +198,3 @@ void __init prom_init_numa_memory(void) pr_info("CP0_PageGrain: CP0 5.1 (0x%x)\n", read_c0_pagegrain()); prom_meminit(); } - -pg_data_t * __init arch_alloc_nodedata(int nid) -{ - return memblock_alloc(sizeof(pg_data_t), SMP_CACHE_BYTES); -} - -void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ - node_data[nid] = pgdat; -} From 46bcce503197d1019ee5c49ccde978e31298e35f Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:51 +0300 Subject: [PATCH 166/416] arch, mm: move definition of node_data to generic code Every architecture that supports NUMA defines node_data in the same way: struct pglist_data *node_data[MAX_NUMNODES]; No reason to keep multiple copies of this definition and its forward declarations, especially when such forward declaration is the only thing in include/asm/mmzone.h for many architectures. Add definition and declaration of node_data to generic code and drop architecture-specific versions. Link: https://lkml.kernel.org/r/20240807064110.1003856-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Reviewed-by: Jonathan Cameron Acked-by: Davidlohr Bueso Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/arm64/include/asm/Kbuild | 1 + arch/arm64/include/asm/mmzone.h | 13 ------------- arch/arm64/include/asm/topology.h | 1 + arch/loongarch/include/asm/Kbuild | 1 + arch/loongarch/include/asm/mmzone.h | 16 ---------------- arch/loongarch/include/asm/topology.h | 1 + arch/loongarch/kernel/numa.c | 3 --- arch/mips/include/asm/mach-ip27/mmzone.h | 4 ---- arch/mips/include/asm/mach-loongson64/mmzone.h | 4 ---- arch/mips/loongson64/numa.c | 2 -- arch/mips/sgi-ip27/ip27-memory.c | 3 --- arch/powerpc/include/asm/mmzone.h | 6 ------ arch/powerpc/mm/numa.c | 2 -- arch/riscv/include/asm/Kbuild | 1 + arch/riscv/include/asm/mmzone.h | 13 ------------- arch/riscv/include/asm/topology.h | 4 ++++ arch/s390/include/asm/Kbuild | 1 + arch/s390/include/asm/mmzone.h | 17 ----------------- arch/s390/kernel/numa.c | 3 --- arch/sh/include/asm/mmzone.h | 3 --- arch/sh/mm/numa.c | 3 --- arch/sparc/include/asm/mmzone.h | 4 ---- arch/sparc/mm/init_64.c | 2 -- arch/x86/include/asm/Kbuild | 1 + arch/x86/include/asm/mmzone.h | 6 ------ arch/x86/include/asm/mmzone_32.h | 17 ----------------- arch/x86/include/asm/mmzone_64.h | 18 ------------------ arch/x86/mm/numa.c | 3 --- drivers/base/arch_numa.c | 2 -- include/asm-generic/mmzone.h | 5 +++++ include/linux/numa.h | 3 +++ mm/numa.c | 3 +++ 32 files changed, 22 insertions(+), 144 deletions(-) delete mode 100644 arch/arm64/include/asm/mmzone.h delete mode 100644 arch/loongarch/include/asm/mmzone.h delete mode 100644 arch/riscv/include/asm/mmzone.h delete mode 100644 arch/s390/include/asm/mmzone.h delete mode 100644 arch/x86/include/asm/mmzone.h delete mode 100644 arch/x86/include/asm/mmzone_32.h delete mode 100644 arch/x86/include/asm/mmzone_64.h create mode 100644 include/asm-generic/mmzone.h diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild index 7d7d97ad3cd5..4e350df9a02d 100644 --- a/arch/arm64/include/asm/Kbuild +++ b/arch/arm64/include/asm/Kbuild @@ -9,6 +9,7 @@ syscall-y += unistd_compat_32.h generic-y += early_ioremap.h generic-y += mcs_spinlock.h +generic-y += mmzone.h generic-y += qrwlock.h generic-y += qspinlock.h generic-y += parport.h diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h deleted file mode 100644 index fa17e01d9ab2..000000000000 --- a/arch/arm64/include/asm/mmzone.h +++ /dev/null @@ -1,13 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __ASM_MMZONE_H -#define __ASM_MMZONE_H - -#ifdef CONFIG_NUMA - -#include - -extern struct pglist_data *node_data[]; -#define NODE_DATA(nid) (node_data[(nid)]) - -#endif /* CONFIG_NUMA */ -#endif /* __ASM_MMZONE_H */ diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 0f6ef432fb84..5fc3af9f8f29 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -5,6 +5,7 @@ #include #ifdef CONFIG_NUMA +#include struct pci_bus; int pcibus_to_node(struct pci_bus *bus); diff --git a/arch/loongarch/include/asm/Kbuild b/arch/loongarch/include/asm/Kbuild index 2bb3676429c0..8fa22cc52774 100644 --- a/arch/loongarch/include/asm/Kbuild +++ b/arch/loongarch/include/asm/Kbuild @@ -9,5 +9,6 @@ generic-y += qrwlock.h generic-y += qspinlock.h generic-y += user.h generic-y += ioctl.h +generic-y += mmzone.h generic-y += statfs.h generic-y += param.h diff --git a/arch/loongarch/include/asm/mmzone.h b/arch/loongarch/include/asm/mmzone.h deleted file mode 100644 index 2b9a90727e19..000000000000 --- a/arch/loongarch/include/asm/mmzone.h +++ /dev/null @@ -1,16 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * Author: Huacai Chen (chenhuacai@loongson.cn) - * Copyright (C) 2020-2022 Loongson Technology Corporation Limited - */ -#ifndef _ASM_MMZONE_H_ -#define _ASM_MMZONE_H_ - -#include -#include - -extern struct pglist_data *node_data[]; - -#define NODE_DATA(nid) (node_data[(nid)]) - -#endif /* _ASM_MMZONE_H_ */ diff --git a/arch/loongarch/include/asm/topology.h b/arch/loongarch/include/asm/topology.h index 66128dec0bf6..50273c9187d0 100644 --- a/arch/loongarch/include/asm/topology.h +++ b/arch/loongarch/include/asm/topology.h @@ -8,6 +8,7 @@ #include #ifdef CONFIG_NUMA +#include extern cpumask_t cpus_on_node[]; diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c index 8fe21f868f72..acada671e020 100644 --- a/arch/loongarch/kernel/numa.c +++ b/arch/loongarch/kernel/numa.c @@ -27,10 +27,7 @@ #include int numa_off; -struct pglist_data *node_data[MAX_NUMNODES]; unsigned char node_distances[MAX_NUMNODES][MAX_NUMNODES]; - -EXPORT_SYMBOL(node_data); EXPORT_SYMBOL(node_distances); static struct numa_meminfo numa_meminfo; diff --git a/arch/mips/include/asm/mach-ip27/mmzone.h b/arch/mips/include/asm/mach-ip27/mmzone.h index 629c3f290203..56959eb9cb26 100644 --- a/arch/mips/include/asm/mach-ip27/mmzone.h +++ b/arch/mips/include/asm/mach-ip27/mmzone.h @@ -24,8 +24,4 @@ extern struct node_data *__node_data[]; #define hub_data(n) (&__node_data[(n)]->hub) -extern struct pglist_data *node_data[]; - -#define NODE_DATA(nid) (node_data[nid]) - #endif /* _ASM_MACH_MMZONE_H */ diff --git a/arch/mips/include/asm/mach-loongson64/mmzone.h b/arch/mips/include/asm/mach-loongson64/mmzone.h index 2effd5f8ed62..8fb70fd3c9c4 100644 --- a/arch/mips/include/asm/mach-loongson64/mmzone.h +++ b/arch/mips/include/asm/mach-loongson64/mmzone.h @@ -14,10 +14,6 @@ #define pa_to_nid(addr) (((addr) & 0xf00000000000) >> NODE_ADDRSPACE_SHIFT) #define nid_to_addrbase(nid) ((unsigned long)(nid) << NODE_ADDRSPACE_SHIFT) -extern struct pglist_data *node_data[]; - -#define NODE_DATA(n) (node_data[n]) - extern void __init prom_init_numa_memory(void); #endif /* _ASM_MACH_MMZONE_H */ diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c index 64fcfaa885b6..d56238745744 100644 --- a/arch/mips/loongson64/numa.c +++ b/arch/mips/loongson64/numa.c @@ -29,8 +29,6 @@ unsigned char __node_distances[MAX_NUMNODES][MAX_NUMNODES]; EXPORT_SYMBOL(__node_distances); -struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(node_data); cpumask_t __node_cpumask[MAX_NUMNODES]; EXPORT_SYMBOL(__node_cpumask); diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c index eb6d2fa41a8a..1963313f55d8 100644 --- a/arch/mips/sgi-ip27/ip27-memory.c +++ b/arch/mips/sgi-ip27/ip27-memory.c @@ -34,9 +34,6 @@ #define SLOT_PFNSHIFT (SLOT_SHIFT - PAGE_SHIFT) #define PFN_NASIDSHFT (NASID_SHFT - PAGE_SHIFT) -struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(node_data); - struct node_data *__node_data[MAX_NUMNODES]; EXPORT_SYMBOL(__node_data); diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index da827d2d0866..d99863cd6cde 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -20,12 +20,6 @@ #ifdef CONFIG_NUMA -extern struct pglist_data *node_data[]; -/* - * Return a pointer to the node data for node n. - */ -#define NODE_DATA(nid) (node_data[nid]) - /* * Following are specific to this numa platform. */ diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index aa89899f0c1a..0744a9a2944b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -43,11 +43,9 @@ static char *cmdline __initdata; int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; -struct pglist_data *node_data[MAX_NUMNODES]; EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); -EXPORT_SYMBOL(node_data); static int primary_domain_index; static int n_mem_addr_cells, n_mem_size_cells; diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild index 5c589770f2a8..1461af12da6e 100644 --- a/arch/riscv/include/asm/Kbuild +++ b/arch/riscv/include/asm/Kbuild @@ -5,6 +5,7 @@ syscall-y += syscall_table_64.h generic-y += early_ioremap.h generic-y += flat.h generic-y += kvm_para.h +generic-y += mmzone.h generic-y += parport.h generic-y += spinlock.h generic-y += spinlock_types.h diff --git a/arch/riscv/include/asm/mmzone.h b/arch/riscv/include/asm/mmzone.h deleted file mode 100644 index fa17e01d9ab2..000000000000 --- a/arch/riscv/include/asm/mmzone.h +++ /dev/null @@ -1,13 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __ASM_MMZONE_H -#define __ASM_MMZONE_H - -#ifdef CONFIG_NUMA - -#include - -extern struct pglist_data *node_data[]; -#define NODE_DATA(nid) (node_data[(nid)]) - -#endif /* CONFIG_NUMA */ -#endif /* __ASM_MMZONE_H */ diff --git a/arch/riscv/include/asm/topology.h b/arch/riscv/include/asm/topology.h index 61183688bdd5..fe1a8bf6902d 100644 --- a/arch/riscv/include/asm/topology.h +++ b/arch/riscv/include/asm/topology.h @@ -4,6 +4,10 @@ #include +#ifdef CONFIG_NUMA +#include +#endif + /* Replace task scheduler's default frequency-invariant accounting */ #define arch_scale_freq_tick topology_scale_freq_tick #define arch_set_freq_scale topology_set_freq_scale diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild index 4b904110d27c..297bf7157968 100644 --- a/arch/s390/include/asm/Kbuild +++ b/arch/s390/include/asm/Kbuild @@ -7,3 +7,4 @@ generated-y += unistd_nr.h generic-y += asm-offsets.h generic-y += kvm_types.h generic-y += mcs_spinlock.h +generic-y += mmzone.h diff --git a/arch/s390/include/asm/mmzone.h b/arch/s390/include/asm/mmzone.h deleted file mode 100644 index 73e3e7c6976c..000000000000 --- a/arch/s390/include/asm/mmzone.h +++ /dev/null @@ -1,17 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * NUMA support for s390 - * - * Copyright IBM Corp. 2015 - */ - -#ifndef _ASM_S390_MMZONE_H -#define _ASM_S390_MMZONE_H - -#ifdef CONFIG_NUMA - -extern struct pglist_data *node_data[]; -#define NODE_DATA(nid) (node_data[nid]) - -#endif /* CONFIG_NUMA */ -#endif /* _ASM_S390_MMZONE_H */ diff --git a/arch/s390/kernel/numa.c b/arch/s390/kernel/numa.c index 23ab9f02f278..ddc1448ea2e1 100644 --- a/arch/s390/kernel/numa.c +++ b/arch/s390/kernel/numa.c @@ -14,9 +14,6 @@ #include #include -struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(node_data); - void __init numa_setup(void) { int nid; diff --git a/arch/sh/include/asm/mmzone.h b/arch/sh/include/asm/mmzone.h index 7b8dead2723d..63f88b465e39 100644 --- a/arch/sh/include/asm/mmzone.h +++ b/arch/sh/include/asm/mmzone.h @@ -5,9 +5,6 @@ #ifdef CONFIG_NUMA #include -extern struct pglist_data *node_data[]; -#define NODE_DATA(nid) (node_data[nid]) - static inline int pfn_to_nid(unsigned long pfn) { int nid; diff --git a/arch/sh/mm/numa.c b/arch/sh/mm/numa.c index 50f0dc1744d0..9bc212b5e762 100644 --- a/arch/sh/mm/numa.c +++ b/arch/sh/mm/numa.c @@ -14,9 +14,6 @@ #include #include -struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; -EXPORT_SYMBOL_GPL(node_data); - /* * On SH machines the conventional approach is to stash system RAM * in node 0, and other memory blocks in to node 1 and up, ordered by diff --git a/arch/sparc/include/asm/mmzone.h b/arch/sparc/include/asm/mmzone.h index a236d8aa893a..74eb2c71d077 100644 --- a/arch/sparc/include/asm/mmzone.h +++ b/arch/sparc/include/asm/mmzone.h @@ -6,10 +6,6 @@ #include -extern struct pglist_data *node_data[]; - -#define NODE_DATA(nid) (node_data[nid]) - extern int numa_cpu_lookup_table[]; extern cpumask_t numa_cpumask_lookup_table[]; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 53d7cb5bbffe..c6c7f43cb1e8 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1115,11 +1115,9 @@ static void init_node_masks_nonnuma(void) } #ifdef CONFIG_NUMA -struct pglist_data *node_data[MAX_NUMNODES]; EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(numa_cpumask_lookup_table); -EXPORT_SYMBOL(node_data); static int scan_pio_for_cfg_handle(struct mdesc_handle *md, u64 pio, u32 cfg_handle) diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild index a192bdea69e2..6c23d1661b17 100644 --- a/arch/x86/include/asm/Kbuild +++ b/arch/x86/include/asm/Kbuild @@ -11,3 +11,4 @@ generated-y += xen-hypercalls.h generic-y += early_ioremap.h generic-y += mcs_spinlock.h +generic-y += mmzone.h diff --git a/arch/x86/include/asm/mmzone.h b/arch/x86/include/asm/mmzone.h deleted file mode 100644 index c41b41edd691..000000000000 --- a/arch/x86/include/asm/mmzone.h +++ /dev/null @@ -1,6 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifdef CONFIG_X86_32 -# include -#else -# include -#endif diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h deleted file mode 100644 index 2d4515e8b7df..000000000000 --- a/arch/x86/include/asm/mmzone_32.h +++ /dev/null @@ -1,17 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * Written by Pat Gaughen (gone@us.ibm.com) Mar 2002 - * - */ - -#ifndef _ASM_X86_MMZONE_32_H -#define _ASM_X86_MMZONE_32_H - -#include - -#ifdef CONFIG_NUMA -extern struct pglist_data *node_data[]; -#define NODE_DATA(nid) (node_data[nid]) -#endif /* CONFIG_NUMA */ - -#endif /* _ASM_X86_MMZONE_32_H */ diff --git a/arch/x86/include/asm/mmzone_64.h b/arch/x86/include/asm/mmzone_64.h deleted file mode 100644 index 0c585046f744..000000000000 --- a/arch/x86/include/asm/mmzone_64.h +++ /dev/null @@ -1,18 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* K8 NUMA support */ -/* Copyright 2002,2003 by Andi Kleen, SuSE Labs */ -/* 2.5 Version loosely based on the NUMAQ Code by Pat Gaughen. */ -#ifndef _ASM_X86_MMZONE_64_H -#define _ASM_X86_MMZONE_64_H - -#ifdef CONFIG_NUMA - -#include -#include - -extern struct pglist_data *node_data[]; - -#define NODE_DATA(nid) (node_data[nid]) - -#endif -#endif /* _ASM_X86_MMZONE_64_H */ diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 6ce10e3c6228..7de725d6bb05 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -24,9 +24,6 @@ int numa_off; nodemask_t numa_nodes_parsed __initdata; -struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; -EXPORT_SYMBOL(node_data); - static struct numa_meminfo numa_meminfo __initdata_or_meminfo; static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 555aee3ee8e7..ceac5b59bf2b 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -15,8 +15,6 @@ #include -struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; -EXPORT_SYMBOL(node_data); nodemask_t numa_nodes_parsed __initdata; static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE }; diff --git a/include/asm-generic/mmzone.h b/include/asm-generic/mmzone.h new file mode 100644 index 000000000000..2ab5193e8394 --- /dev/null +++ b/include/asm-generic/mmzone.h @@ -0,0 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_GENERIC_MMZONE_H +#define _ASM_GENERIC_MMZONE_H + +#endif diff --git a/include/linux/numa.h b/include/linux/numa.h index eb19503604fe..e5841d4057ab 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -30,6 +30,9 @@ static inline bool numa_valid_node(int nid) #ifdef CONFIG_NUMA #include +extern struct pglist_data *node_data[]; +#define NODE_DATA(nid) (node_data[nid]) + /* Generic implementation available */ int numa_nearest_node(int node, unsigned int state); diff --git a/mm/numa.c b/mm/numa.c index 67ca6b8585c0..8c157d41c026 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -3,6 +3,9 @@ #include #include +struct pglist_data *node_data[MAX_NUMNODES]; +EXPORT_SYMBOL(node_data); + /* Stub functions: */ #ifndef memory_add_physaddr_to_nid From ec164cf1dd3df4ae58931b9b491b06a90d5c22d3 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:52 +0300 Subject: [PATCH 167/416] mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore. Replace the call to arch_alloc_nodedata() in free_area_init() with a new helper alloc_offline_node_data(), remove arch_refresh_nodedata() and cleanup include/linux/memory_hotplug.h from the associated ifdefery. Link: https://lkml.kernel.org/r/20240807064110.1003856-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Cameron Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/memory_hotplug.h | 48 ---------------------------------- include/linux/numa.h | 4 +++ mm/mm_init.c | 10 ++----- mm/numa.c | 12 +++++++++ 4 files changed, 18 insertions(+), 56 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index ebe876930e78..b27ddce5d324 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -16,54 +16,6 @@ struct resource; struct vmem_altmap; struct dev_pagemap; -#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION -/* - * For supporting node-hotadd, we have to allocate a new pgdat. - * - * If an arch has generic style NODE_DATA(), - * node_data[nid] = kzalloc() works well. But it depends on the architecture. - * - * In general, generic_alloc_nodedata() is used. - * - */ -extern pg_data_t *arch_alloc_nodedata(int nid); -extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); - -#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - -#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) - -#ifdef CONFIG_NUMA -/* - * XXX: node aware allocation can't work well to get new node's memory at this time. - * Because, pgdat for the new node is not allocated/initialized yet itself. - * To use new node's memory, more consideration will be necessary. - */ -#define generic_alloc_nodedata(nid) \ -({ \ - memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES); \ -}) - -extern pg_data_t *node_data[]; -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ - node_data[nid] = pgdat; -} - -#else /* !CONFIG_NUMA */ - -/* never called */ -static inline pg_data_t *generic_alloc_nodedata(int nid) -{ - BUG(); - return NULL; -} -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ -} -#endif /* CONFIG_NUMA */ -#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - #ifdef CONFIG_MEMORY_HOTPLUG struct page *pfn_to_online_page(unsigned long pfn); diff --git a/include/linux/numa.h b/include/linux/numa.h index e5841d4057ab..b41b1569781b 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -33,6 +33,8 @@ static inline bool numa_valid_node(int nid) extern struct pglist_data *node_data[]; #define NODE_DATA(nid) (node_data[nid]) +void __init alloc_offline_node_data(int nid); + /* Generic implementation available */ int numa_nearest_node(int node, unsigned int state); @@ -60,6 +62,8 @@ static inline int phys_to_target_node(u64 start) { return 0; } + +static inline void alloc_offline_node_data(int nid) {} #endif #define numa_map_to_online_node(node) numa_nearest_node(node, N_ONLINE) diff --git a/mm/mm_init.c b/mm/mm_init.c index f3106b6299d3..4ba5607aaf19 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1835,14 +1835,8 @@ void __init free_area_init(unsigned long *max_zone_pfn) for_each_node(nid) { pg_data_t *pgdat; - if (!node_online(nid)) { - /* Allocator not initialized yet */ - pgdat = arch_alloc_nodedata(nid); - if (!pgdat) - panic("Cannot allocate %zuB for node %d.\n", - sizeof(*pgdat), nid); - arch_refresh_nodedata(nid, pgdat); - } + if (!node_online(nid)) + alloc_offline_node_data(nid); pgdat = NODE_DATA(nid); free_area_init_node(nid); diff --git a/mm/numa.c b/mm/numa.c index 8c157d41c026..64e1b7d2c1ee 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -6,6 +6,18 @@ struct pglist_data *node_data[MAX_NUMNODES]; EXPORT_SYMBOL(node_data); +void __init alloc_offline_node_data(int nid) +{ + pg_data_t *pgdat; + + pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES); + if (!pgdat) + panic("Cannot allocate %zuB for node %d.\n", + sizeof(*pgdat), nid); + + node_data[nid] = pgdat; +} + /* Stub functions: */ #ifndef memory_add_physaddr_to_nid From 3515863d9f29dd476cdad875fbd558fc85b4c511 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:53 +0300 Subject: [PATCH 168/416] arch, mm: pull out allocation of NODE_DATA to generic code Architectures that support NUMA duplicate the code that allocates NODE_DATA on the node-local memory with slight variations in reporting of the addresses where the memory was allocated. Use x86 version as the basis for the generic alloc_node_data() function and call this function in architecture specific numa initialization. Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like x86 used to do since the bootmem era when allocation granularity was PAGE_SIZE anyway. Link: https://lkml.kernel.org/r/20240807064110.1003856-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/loongarch/kernel/numa.c | 18 ------------------ arch/mips/loongson64/numa.c | 16 ++-------------- arch/powerpc/mm/numa.c | 20 +------------------- arch/sh/mm/init.c | 7 +------ arch/sparc/mm/init_64.c | 9 ++------- arch/x86/mm/numa.c | 34 +--------------------------------- drivers/base/arch_numa.c | 21 +-------------------- include/linux/numa.h | 1 + mm/numa.c | 27 +++++++++++++++++++++++++++ 9 files changed, 36 insertions(+), 117 deletions(-) diff --git a/arch/loongarch/kernel/numa.c b/arch/loongarch/kernel/numa.c index acada671e020..84fe7f854820 100644 --- a/arch/loongarch/kernel/numa.c +++ b/arch/loongarch/kernel/numa.c @@ -187,24 +187,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) return numa_add_memblk_to(nid, start, end, &numa_meminfo); } -static void __init alloc_node_data(int nid) -{ - void *nd; - unsigned long nd_pa; - size_t nd_sz = roundup(sizeof(pg_data_t), PAGE_SIZE); - - nd_pa = memblock_phys_alloc_try_nid(nd_sz, SMP_CACHE_BYTES, nid); - if (!nd_pa) { - pr_err("Cannot find %zu Byte for node_data (initial node: %d)\n", nd_sz, nid); - return; - } - - nd = __va(nd_pa); - - node_data[nid] = nd; - memset(nd, 0, sizeof(pg_data_t)); -} - static void __init node_mem_init(unsigned int node) { unsigned long start_pfn, end_pfn; diff --git a/arch/mips/loongson64/numa.c b/arch/mips/loongson64/numa.c index d56238745744..8388400d052f 100644 --- a/arch/mips/loongson64/numa.c +++ b/arch/mips/loongson64/numa.c @@ -81,12 +81,8 @@ static void __init init_topology_matrix(void) static void __init node_mem_init(unsigned int node) { - struct pglist_data *nd; unsigned long node_addrspace_offset; unsigned long start_pfn, end_pfn; - unsigned long nd_pa; - int tnid; - const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); node_addrspace_offset = nid_to_addrbase(node); pr_info("Node%d's addrspace_offset is 0x%lx\n", @@ -96,16 +92,8 @@ static void __init node_mem_init(unsigned int node) pr_info("Node%d: start_pfn=0x%lx, end_pfn=0x%lx\n", node, start_pfn, end_pfn); - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, node); - if (!nd_pa) - panic("Cannot allocate %zu bytes for node %d data\n", - nd_size, node); - nd = __va(nd_pa); - memset(nd, 0, sizeof(struct pglist_data)); - tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); - if (tnid != node) - pr_info("NODE_DATA(%d) on node %d\n", node, tnid); - node_data[node] = nd; + alloc_node_data(node); + NODE_DATA(node)->node_start_pfn = start_pfn; NODE_DATA(node)->node_spanned_pages = end_pfn - start_pfn; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0744a9a2944b..3c1da08304d0 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1093,27 +1093,9 @@ void __init dump_numa_cpu_topology(void) static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) { u64 spanned_pages = end_pfn - start_pfn; - const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); - u64 nd_pa; - void *nd; - int tnid; - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); - if (!nd_pa) - panic("Cannot allocate %zu bytes for node %d data\n", - nd_size, nid); + alloc_node_data(nid); - nd = __va(nd_pa); - - /* report and initialize */ - pr_info(" NODE_DATA [mem %#010Lx-%#010Lx]\n", - nd_pa, nd_pa + nd_size - 1); - tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); - if (tnid != nid) - pr_info(" NODE_DATA(%d) on node %d\n", nid, tnid); - - node_data[nid] = nd; - memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); NODE_DATA(nid)->node_id = nid; NODE_DATA(nid)->node_start_pfn = start_pfn; NODE_DATA(nid)->node_spanned_pages = spanned_pages; diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index d1fe90b2f5ff..2a88b0c9e70f 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -212,12 +212,7 @@ void __init allocate_pgdat(unsigned int nid) get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); #ifdef CONFIG_NUMA - NODE_DATA(nid) = memblock_alloc_try_nid( - sizeof(struct pglist_data), - SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT, - MEMBLOCK_ALLOC_ACCESSIBLE, nid); - if (!NODE_DATA(nid)) - panic("Can't allocate pgdat for node %d\n", nid); + alloc_node_data(nid); #endif NODE_DATA(nid)->node_start_pfn = start_pfn; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index c6c7f43cb1e8..21f8cbbd0581 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1075,14 +1075,9 @@ static void __init allocate_node_data(int nid) { struct pglist_data *p; unsigned long start_pfn, end_pfn; -#ifdef CONFIG_NUMA - NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data), - SMP_CACHE_BYTES, nid); - if (!NODE_DATA(nid)) { - prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid); - prom_halt(); - } +#ifdef CONFIG_NUMA + alloc_node_data(nid); NODE_DATA(nid)->node_id = nid; #endif diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 7de725d6bb05..5e1dde26674b 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -191,39 +191,6 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) return numa_add_memblk_to(nid, start, end, &numa_meminfo); } -/* Allocate NODE_DATA for a node on the local memory */ -static void __init alloc_node_data(int nid) -{ - const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE); - u64 nd_pa; - void *nd; - int tnid; - - /* - * Allocate node data. Try node-local memory and then any node. - * Never allocate in DMA zone. - */ - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); - if (!nd_pa) { - pr_err("Cannot find %zu bytes in any node (initial node: %d)\n", - nd_size, nid); - return; - } - nd = __va(nd_pa); - - /* report and initialize */ - printk(KERN_INFO "NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid, - nd_pa, nd_pa + nd_size - 1); - tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); - if (tnid != nid) - printk(KERN_INFO " NODE_DATA(%d) on node %d\n", nid, tnid); - - node_data[nid] = nd; - memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); - - node_set_online(nid); -} - /** * numa_cleanup_meminfo - Cleanup a numa_meminfo * @mi: numa_meminfo to clean up @@ -571,6 +538,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + node_set_online(nid); } /* Dump memblock with node info and return. */ diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index ceac5b59bf2b..b6af7475ec44 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -216,30 +216,11 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) */ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) { - const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); - u64 nd_pa; - void *nd; - int tnid; - if (start_pfn >= end_pfn) pr_info("Initmem setup node %d []\n", nid); - nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); - if (!nd_pa) - panic("Cannot allocate %zu bytes for node %d data\n", - nd_size, nid); + alloc_node_data(nid); - nd = __va(nd_pa); - - /* report and initialize */ - pr_info("NODE_DATA [mem %#010Lx-%#010Lx]\n", - nd_pa, nd_pa + nd_size - 1); - tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); - if (tnid != nid) - pr_info("NODE_DATA(%d) on node %d\n", nid, tnid); - - node_data[nid] = nd; - memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); NODE_DATA(nid)->node_id = nid; NODE_DATA(nid)->node_start_pfn = start_pfn; NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn; diff --git a/include/linux/numa.h b/include/linux/numa.h index b41b1569781b..3567e40329eb 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -33,6 +33,7 @@ static inline bool numa_valid_node(int nid) extern struct pglist_data *node_data[]; #define NODE_DATA(nid) (node_data[nid]) +void __init alloc_node_data(int nid); void __init alloc_offline_node_data(int nid); /* Generic implementation available */ diff --git a/mm/numa.c b/mm/numa.c index 64e1b7d2c1ee..1f1582dcdf4a 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -1,11 +1,38 @@ // SPDX-License-Identifier: GPL-2.0-or-later +#include #include #include struct pglist_data *node_data[MAX_NUMNODES]; EXPORT_SYMBOL(node_data); +/* Allocate NODE_DATA for a node on the local memory */ +void __init alloc_node_data(int nid) +{ + const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES); + u64 nd_pa; + void *nd; + int tnid; + + /* Allocate node data. Try node-local memory and then any node. */ + nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); + if (!nd_pa) + panic("Cannot allocate %zu bytes for node %d data\n", + nd_size, nid); + nd = __va(nd_pa); + + /* report and initialize */ + pr_info("NODE_DATA(%d) allocated [mem %#010Lx-%#010Lx]\n", nid, + nd_pa, nd_pa + nd_size - 1); + tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT); + if (tnid != nid) + pr_info(" NODE_DATA(%d) on node %d\n", nid, tnid); + + node_data[nid] = nd; + memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); +} + void __init alloc_offline_node_data(int nid) { pg_data_t *pgdat; From 9916c27d1ff0964ba7ada63909e9eec17a318363 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:54 +0300 Subject: [PATCH 169/416] x86/numa: simplify numa_distance allocation Allocation of numa_distance uses memblock_phys_alloc_range() to limit allocation to be below the last mapped page. But NUMA initializaition runs after the direct map is populated and there is also code in setup_arch() that adjusts memblock limit to reflect how much memory is already mapped in the direct map. Simplify the allocation of numa_distance and use plain memblock_alloc(). Link: https://lkml.kernel.org/r/20240807064110.1003856-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/numa.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 5e1dde26674b..edfc38803779 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -331,7 +331,6 @@ static int __init numa_alloc_distance(void) nodemask_t nodes_parsed; size_t size; int i, j, cnt = 0; - u64 phys; /* size the new table and allocate it */ nodes_parsed = numa_nodes_parsed; @@ -342,16 +341,14 @@ static int __init numa_alloc_distance(void) cnt++; size = cnt * cnt * sizeof(numa_distance[0]); - phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, - PFN_PHYS(max_pfn_mapped)); - if (!phys) { + numa_distance = memblock_alloc(size, PAGE_SIZE); + if (!numa_distance) { pr_warn("Warning: can't allocate distance table!\n"); /* don't retry until explicitly reset */ numa_distance = (void *)1LU; return -ENOMEM; } - numa_distance = __va(phys); numa_distance_cnt = cnt; /* fill with the default distances */ From 77c1d0e7c580c98b256ee81763120e2c228d90cd Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:55 +0300 Subject: [PATCH 170/416] x86/numa: use get_pfn_range_for_nid to verify that node spans memory Instead of looping over numa_meminfo array to detect node's start and end addresses use get_pfn_range_for_init(). This is shorter and make it easier to lift numa_memblks to generic code. Link: https://lkml.kernel.org/r/20240807064110.1003856-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/numa.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index edfc38803779..30b0ec801b02 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -521,17 +521,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) /* Finally register nodes. */ for_each_node_mask(nid, node_possible_map) { - u64 start = PFN_PHYS(max_pfn); - u64 end = 0; + unsigned long start_pfn, end_pfn; - for (i = 0; i < mi->nr_blks; i++) { - if (nid != mi->blk[i].nid) - continue; - start = min(mi->blk[i].start, start); - end = max(mi->blk[i].end, end); - } - - if (start >= end) + /* + * Note, get_pfn_range_for_nid() depends on + * memblock_set_node() having already happened + */ + get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); + if (start_pfn >= end_pfn) continue; alloc_node_data(nid); From e4a5e5a5c50a99f9cc489926464fe07175593552 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:56 +0300 Subject: [PATCH 171/416] x86/numa: move FAKE_NODE_* defines to numa_emu The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are only used by numa emulation code, make them local to arch/x86/mm/numa_emulation.c Link: https://lkml.kernel.org/r/20240807064110.1003856-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/numa.h | 2 -- arch/x86/mm/numa_emulation.c | 3 +++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index ef2844d69173..2dab1ada96cf 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -71,8 +71,6 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable); #endif #ifdef CONFIG_NUMA_EMU -#define FAKE_NODE_MIN_SIZE ((u64)32 << 20) -#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) int numa_emu_cmdline(char *str); #else /* CONFIG_NUMA_EMU */ static inline int numa_emu_cmdline(char *str) diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 9a9305367fdd..1ce22e315b80 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -10,6 +10,9 @@ #include "numa_internal.h" +#define FAKE_NODE_MIN_SIZE ((u64)32 << 20) +#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) + static int emu_nid_to_phys[MAX_NUMNODES]; static char *emu_cmdline __initdata; From e3c1299c3282490800b732093768e556fff3a2e2 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:57 +0300 Subject: [PATCH 172/416] x86/numa_emu: simplify allocation of phys_dist By the time numa_emulation() is called, all physical memory is already mapped in the direct map and there is no need to define limits for memblock allocation. Replace memblock_phys_alloc_range() with memblock_alloc(). Link: https://lkml.kernel.org/r/20240807064110.1003856-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/numa_emulation.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 1ce22e315b80..439804e21962 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -448,15 +448,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) /* copy the physical distance table */ if (numa_dist_cnt) { - u64 phys; - - phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0, - PFN_PHYS(max_pfn_mapped)); - if (!phys) { + phys_dist = memblock_alloc(phys_size, PAGE_SIZE); + if (!phys_dist) { pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n"); goto no_emu; } - phys_dist = __va(phys); for (i = 0; i < numa_dist_cnt; i++) for (j = 0; j < numa_dist_cnt; j++) From 55e74bcca735ab51c3fa3e9571489dc72e993fa4 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:58 +0300 Subject: [PATCH 173/416] x86/numa_emu: split __apicid_to_node update to a helper function This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/20240807064110.1003856-15-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/numa.h | 2 ++ arch/x86/mm/numa.c | 22 ++++++++++++++++++++++ arch/x86/mm/numa_emulation.c | 14 +------------- 3 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 2dab1ada96cf..7017d540894a 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -72,6 +72,8 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable); #ifdef CONFIG_NUMA_EMU int numa_emu_cmdline(char *str); +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids); #else /* CONFIG_NUMA_EMU */ static inline int numa_emu_cmdline(char *str) { diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 30b0ec801b02..ea3fc2d866e2 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -852,6 +852,28 @@ EXPORT_SYMBOL(cpumask_of_node); #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ +#ifdef CONFIG_NUMA_EMU +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids) +{ + int i, j; + + /* + * Transform __apicid_to_node table to use emulated nids by + * reverse-mapping phys_nid. The maps should always exist but fall + * back to zero just in case. + */ + for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { + if (__apicid_to_node[i] == NUMA_NO_NODE) + continue; + for (j = 0; j < nr_emu_nids; j++) + if (__apicid_to_node[i] == emu_nid_to_phys[j]) + break; + __apicid_to_node[i] = j < nr_emu_nids ? j : 0; + } +} +#endif /* CONFIG_NUMA_EMU */ + #ifdef CONFIG_NUMA_KEEP_MEMINFO static int meminfo_to_nid(struct numa_meminfo *mi, u64 start) { diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 439804e21962..f2746e52ab93 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -476,19 +476,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) ei.blk[i].nid != NUMA_NO_NODE) node_set(ei.blk[i].nid, numa_nodes_parsed); - /* - * Transform __apicid_to_node table to use emulated nids by - * reverse-mapping phys_nid. The maps should always exist but fall - * back to zero just in case. - */ - for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { - if (__apicid_to_node[i] == NUMA_NO_NODE) - continue; - for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++) - if (__apicid_to_node[i] == emu_nid_to_phys[j]) - break; - __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0; - } + numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys)); /* make sure all emulated nodes are mapped to a physical node */ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) From e52d5873d13aeb54c5baa6e1c388f6cbf69f97f4 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:40:59 +0300 Subject: [PATCH 174/416] x86/numa_emu: use a helper function to get MAX_DMA32_PFN This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/20240807064110.1003856-16-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/numa.h | 1 + arch/x86/mm/numa.c | 5 +++++ arch/x86/mm/numa_emulation.c | 4 ++-- 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 7017d540894a..b22c85c1ef18 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -74,6 +74,7 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable); int numa_emu_cmdline(char *str); void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, unsigned int nr_emu_nids); +u64 __init numa_emu_dma_end(void); #else /* CONFIG_NUMA_EMU */ static inline int numa_emu_cmdline(char *str) { diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index ea3fc2d866e2..2acf8d17b9b8 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -872,6 +872,11 @@ void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, __apicid_to_node[i] = j < nr_emu_nids ? j : 0; } } + +u64 __init numa_emu_dma_end(void) +{ + return PFN_PHYS(MAX_DMA32_PFN); +} #endif /* CONFIG_NUMA_EMU */ #ifdef CONFIG_NUMA_KEEP_MEMINFO diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index f2746e52ab93..fb4814497446 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -128,7 +128,7 @@ static int __init split_nodes_interleave(struct numa_meminfo *ei, */ while (!nodes_empty(physnode_mask)) { for_each_node_mask(i, physnode_mask) { - u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); + u64 dma32_end = numa_emu_dma_end(); u64 start, limit, end; int phys_blk; @@ -275,7 +275,7 @@ static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei, */ while (!nodes_empty(physnode_mask)) { for_each_node_mask(i, physnode_mask) { - u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); + u64 dma32_end = numa_emu_dma_end(); u64 start, limit, end; int phys_blk; From 7a7152857d962ae742dfe3eccc64d6b706f91c40 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:00 +0300 Subject: [PATCH 175/416] x86/numa: numa_{add,remove}_cpu: make cpu parameter unsigned CPU id cannot be negative. Making it unsigned also aligns with declarations in include/asm-generic/numa.h used by arm64 and riscv and allows sharing numa emulation code with these architectures. Link: https://lkml.kernel.org/r/20240807064110.1003856-17-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/numa.h | 10 +++++----- arch/x86/mm/numa.c | 10 +++++----- arch/x86/mm/numa_emulation.c | 10 +++++----- 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index b22c85c1ef18..6fa5ea925aac 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -54,20 +54,20 @@ static inline int numa_cpu_node(int cpu) extern void numa_set_node(int cpu, int node); extern void numa_clear_node(int cpu); extern void __init init_cpu_to_node(void); -extern void numa_add_cpu(int cpu); -extern void numa_remove_cpu(int cpu); +extern void numa_add_cpu(unsigned int cpu); +extern void numa_remove_cpu(unsigned int cpu); extern void init_gi_nodes(void); #else /* CONFIG_NUMA */ static inline void numa_set_node(int cpu, int node) { } static inline void numa_clear_node(int cpu) { } static inline void init_cpu_to_node(void) { } -static inline void numa_add_cpu(int cpu) { } -static inline void numa_remove_cpu(int cpu) { } +static inline void numa_add_cpu(unsigned int cpu) { } +static inline void numa_remove_cpu(unsigned int cpu) { } static inline void init_gi_nodes(void) { } #endif /* CONFIG_NUMA */ #ifdef CONFIG_DEBUG_PER_CPU_MAPS -void debug_cpumask_set_cpu(int cpu, int node, bool enable); +void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable); #endif #ifdef CONFIG_NUMA_EMU diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 2acf8d17b9b8..bf56f667fe0f 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -741,12 +741,12 @@ void __init init_cpu_to_node(void) #ifndef CONFIG_DEBUG_PER_CPU_MAPS # ifndef CONFIG_NUMA_EMU -void numa_add_cpu(int cpu) +void numa_add_cpu(unsigned int cpu) { cpumask_set_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); } -void numa_remove_cpu(int cpu) +void numa_remove_cpu(unsigned int cpu) { cpumask_clear_cpu(cpu, node_to_cpumask_map[early_cpu_to_node(cpu)]); } @@ -784,7 +784,7 @@ int early_cpu_to_node(int cpu) return per_cpu(x86_cpu_to_node_map, cpu); } -void debug_cpumask_set_cpu(int cpu, int node, bool enable) +void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable) { struct cpumask *mask; @@ -816,12 +816,12 @@ static void numa_set_cpumask(int cpu, bool enable) debug_cpumask_set_cpu(cpu, early_cpu_to_node(cpu), enable); } -void numa_add_cpu(int cpu) +void numa_add_cpu(unsigned int cpu) { numa_set_cpumask(cpu, true); } -void numa_remove_cpu(int cpu) +void numa_remove_cpu(unsigned int cpu) { numa_set_cpumask(cpu, false); } diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index fb4814497446..235f8a4eb2fa 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -514,7 +514,7 @@ no_emu: } #ifndef CONFIG_DEBUG_PER_CPU_MAPS -void numa_add_cpu(int cpu) +void numa_add_cpu(unsigned int cpu) { int physnid, nid; @@ -532,7 +532,7 @@ void numa_add_cpu(int cpu) cpumask_set_cpu(cpu, node_to_cpumask_map[nid]); } -void numa_remove_cpu(int cpu) +void numa_remove_cpu(unsigned int cpu) { int i; @@ -540,7 +540,7 @@ void numa_remove_cpu(int cpu) cpumask_clear_cpu(cpu, node_to_cpumask_map[i]); } #else /* !CONFIG_DEBUG_PER_CPU_MAPS */ -static void numa_set_cpumask(int cpu, bool enable) +static void numa_set_cpumask(unsigned int cpu, bool enable) { int nid, physnid; @@ -560,12 +560,12 @@ static void numa_set_cpumask(int cpu, bool enable) } } -void numa_add_cpu(int cpu) +void numa_add_cpu(unsigned int cpu) { numa_set_cpumask(cpu, true); } -void numa_remove_cpu(int cpu) +void numa_remove_cpu(unsigned int cpu) { numa_set_cpumask(cpu, false); } From 87482708210ff3333adab4dc5f1a491793159d43 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:01 +0300 Subject: [PATCH 176/416] mm: introduce numa_memblks Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig options to let x86 select it in its Kconfig. This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-18-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 1 + arch/x86/include/asm/numa.h | 3 - arch/x86/mm/amdtopology.c | 1 + arch/x86/mm/numa.c | 372 +-------------------------------- arch/x86/mm/numa_emulation.c | 1 + arch/x86/mm/numa_internal.h | 15 +- drivers/acpi/numa/srat.c | 1 + drivers/of/of_numa.c | 1 + include/linux/numa_memblks.h | 35 ++++ mm/Kconfig | 3 + mm/Makefile | 1 + mm/numa_memblks.c | 385 +++++++++++++++++++++++++++++++++++ 12 files changed, 436 insertions(+), 383 deletions(-) create mode 100644 include/linux/numa_memblks.h create mode 100644 mm/numa_memblks.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 007bab9f2a0e..74afb59c6603 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -296,6 +296,7 @@ config X86 select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK select NEED_SG_DMA_LENGTH + select NUMA_MEMBLKS if NUMA select PCI_DOMAINS if PCI select PCI_LOCKLESS_CONFIG if PCI select PERF_EVENTS diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 6fa5ea925aac..6e9a50bf03d4 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -10,8 +10,6 @@ #ifdef CONFIG_NUMA -#define NR_NODE_MEMBLKS (MAX_NUMNODES*2) - extern int numa_off; /* @@ -25,7 +23,6 @@ extern int numa_off; extern s16 __apicid_to_node[MAX_LOCAL_APIC]; extern nodemask_t numa_nodes_parsed __initdata; -extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); static inline void set_apicid_to_node(int apicid, s16 node) diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c index 9332b36a1091..628833afee37 100644 --- a/arch/x86/mm/amdtopology.c +++ b/arch/x86/mm/amdtopology.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index bf56f667fe0f..0bada905f409 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -22,10 +23,6 @@ #include "numa_internal.h" int numa_off; -nodemask_t numa_nodes_parsed __initdata; - -static struct numa_meminfo numa_meminfo __initdata_or_meminfo; -static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; static int numa_distance_cnt; static u8 *numa_distance; @@ -121,194 +118,6 @@ void __init setup_node_to_cpumask_map(void) pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids); } -static int __init numa_add_memblk_to(int nid, u64 start, u64 end, - struct numa_meminfo *mi) -{ - /* ignore zero length blks */ - if (start == end) - return 0; - - /* whine about and ignore invalid blks */ - if (start > end || nid < 0 || nid >= MAX_NUMNODES) { - pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n", - nid, start, end - 1); - return 0; - } - - if (mi->nr_blks >= NR_NODE_MEMBLKS) { - pr_err("too many memblk ranges\n"); - return -EINVAL; - } - - mi->blk[mi->nr_blks].start = start; - mi->blk[mi->nr_blks].end = end; - mi->blk[mi->nr_blks].nid = nid; - mi->nr_blks++; - return 0; -} - -/** - * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo - * @idx: Index of memblk to remove - * @mi: numa_meminfo to remove memblk from - * - * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and - * decrementing @mi->nr_blks. - */ -void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) -{ - mi->nr_blks--; - memmove(&mi->blk[idx], &mi->blk[idx + 1], - (mi->nr_blks - idx) * sizeof(mi->blk[0])); -} - -/** - * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another - * @dst: numa_meminfo to append block to - * @idx: Index of memblk to remove - * @src: numa_meminfo to remove memblk from - */ -static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx, - struct numa_meminfo *src) -{ - dst->blk[dst->nr_blks++] = src->blk[idx]; - numa_remove_memblk_from(idx, src); -} - -/** - * numa_add_memblk - Add one numa_memblk to numa_meminfo - * @nid: NUMA node ID of the new memblk - * @start: Start address of the new memblk - * @end: End address of the new memblk - * - * Add a new memblk to the default numa_meminfo. - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int __init numa_add_memblk(int nid, u64 start, u64 end) -{ - return numa_add_memblk_to(nid, start, end, &numa_meminfo); -} - -/** - * numa_cleanup_meminfo - Cleanup a numa_meminfo - * @mi: numa_meminfo to clean up - * - * Sanitize @mi by merging and removing unnecessary memblks. Also check for - * conflicts and clear unused memblks. - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int __init numa_cleanup_meminfo(struct numa_meminfo *mi) -{ - const u64 low = 0; - const u64 high = PFN_PHYS(max_pfn); - int i, j, k; - - /* first, trim all entries */ - for (i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *bi = &mi->blk[i]; - - /* move / save reserved memory ranges */ - if (!memblock_overlaps_region(&memblock.memory, - bi->start, bi->end - bi->start)) { - numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi); - continue; - } - - /* make sure all non-reserved blocks are inside the limits */ - bi->start = max(bi->start, low); - - /* preserve info for non-RAM areas above 'max_pfn': */ - if (bi->end > high) { - numa_add_memblk_to(bi->nid, high, bi->end, - &numa_reserved_meminfo); - bi->end = high; - } - - /* and there's no empty block */ - if (bi->start >= bi->end) - numa_remove_memblk_from(i--, mi); - } - - /* merge neighboring / overlapping entries */ - for (i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *bi = &mi->blk[i]; - - for (j = i + 1; j < mi->nr_blks; j++) { - struct numa_memblk *bj = &mi->blk[j]; - u64 start, end; - - /* - * See whether there are overlapping blocks. Whine - * about but allow overlaps of the same nid. They - * will be merged below. - */ - if (bi->end > bj->start && bi->start < bj->end) { - if (bi->nid != bj->nid) { - pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, - bj->nid, bj->start, bj->end - 1); - return -EINVAL; - } - pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, - bj->start, bj->end - 1); - } - - /* - * Join together blocks on the same node, holes - * between which don't overlap with memory on other - * nodes. - */ - if (bi->nid != bj->nid) - continue; - start = min(bi->start, bj->start); - end = max(bi->end, bj->end); - for (k = 0; k < mi->nr_blks; k++) { - struct numa_memblk *bk = &mi->blk[k]; - - if (bi->nid == bk->nid) - continue; - if (start < bk->end && end > bk->start) - break; - } - if (k < mi->nr_blks) - continue; - printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, bj->start, - bj->end - 1, start, end - 1); - bi->start = start; - bi->end = end; - numa_remove_memblk_from(j--, mi); - } - } - - /* clear unused ones */ - for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) { - mi->blk[i].start = mi->blk[i].end = 0; - mi->blk[i].nid = NUMA_NO_NODE; - } - - return 0; -} - -/* - * Set nodes, which have memory in @mi, in *@nodemask. - */ -static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, - const struct numa_meminfo *mi) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(mi->blk); i++) - if (mi->blk[i].start != mi->blk[i].end && - mi->blk[i].nid != NUMA_NO_NODE) - node_set(mi->blk[i].nid, *nodemask); -} - /** * numa_reset_distance - Reset NUMA distance table * @@ -410,111 +219,13 @@ int __node_distance(int from, int to) } EXPORT_SYMBOL(__node_distance); -/* - * Mark all currently memblock-reserved physical memory (which covers the - * kernel's own memory ranges) as hot-unswappable. - */ -static void __init numa_clear_kernel_node_hotplug(void) -{ - nodemask_t reserved_nodemask = NODE_MASK_NONE; - struct memblock_region *mb_region; - int i; - - /* - * We have to do some preprocessing of memblock regions, to - * make them suitable for reservation. - * - * At this time, all memory regions reserved by memblock are - * used by the kernel, but those regions are not split up - * along node boundaries yet, and don't necessarily have their - * node ID set yet either. - * - * So iterate over all memory known to the x86 architecture, - * and use those ranges to set the nid in memblock.reserved. - * This will split up the memblock regions along node - * boundaries and will set the node IDs as well. - */ - for (i = 0; i < numa_meminfo.nr_blks; i++) { - struct numa_memblk *mb = numa_meminfo.blk + i; - int ret; - - ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid); - WARN_ON_ONCE(ret); - } - - /* - * Now go over all reserved memblock regions, to construct a - * node mask of all kernel reserved memory areas. - * - * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, - * numa_meminfo might not include all memblock.reserved - * memory ranges, because quirks such as trim_snb_memory() - * reserve specific pages for Sandy Bridge graphics. ] - */ - for_each_reserved_mem_region(mb_region) { - int nid = memblock_get_region_node(mb_region); - - if (nid != NUMA_NO_NODE) - node_set(nid, reserved_nodemask); - } - - /* - * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory - * belonging to the reserved node mask. - * - * Note that this will include memory regions that reside - * on nodes that contain kernel memory - entire nodes - * become hot-unpluggable: - */ - for (i = 0; i < numa_meminfo.nr_blks; i++) { - struct numa_memblk *mb = numa_meminfo.blk + i; - - if (!node_isset(mb->nid, reserved_nodemask)) - continue; - - memblock_clear_hotplug(mb->start, mb->end - mb->start); - } -} - static int __init numa_register_memblks(struct numa_meminfo *mi) { - int i, nid; + int nid, err; - /* Account for nodes with cpus and no memory */ - node_possible_map = numa_nodes_parsed; - numa_nodemask_from_meminfo(&node_possible_map, mi); - if (WARN_ON(nodes_empty(node_possible_map))) - return -EINVAL; - - for (i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *mb = &mi->blk[i]; - memblock_set_node(mb->start, mb->end - mb->start, - &memblock.memory, mb->nid); - } - - /* - * At very early time, the kernel have to use some memory such as - * loading the kernel image. We cannot prevent this anyway. So any - * node the kernel resides in should be un-hotpluggable. - * - * And when we come here, alloc node data won't fail. - */ - numa_clear_kernel_node_hotplug(); - - /* - * If sections array is gonna be used for pfn -> nid mapping, check - * whether its granularity is fine enough. - */ - if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) { - unsigned long pfn_align = node_map_pfn_alignment(); - - if (pfn_align && pfn_align < PAGES_PER_SECTION) { - pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n", - PFN_PHYS(pfn_align) >> 20, - PFN_PHYS(PAGES_PER_SECTION) >> 20); - return -EINVAL; - } - } + err = numa_register_meminfo(mi); + if (err) + return err; if (!memblock_validate_numa_coverage(SZ_1M)) return -EINVAL; @@ -916,76 +627,3 @@ int memory_add_physaddr_to_nid(u64 start) EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif - -static int __init cmp_memblk(const void *a, const void *b) -{ - const struct numa_memblk *ma = *(const struct numa_memblk **)a; - const struct numa_memblk *mb = *(const struct numa_memblk **)b; - - return (ma->start > mb->start) - (ma->start < mb->start); -} - -static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata; - -/** - * numa_fill_memblks - Fill gaps in numa_meminfo memblks - * @start: address to begin fill - * @end: address to end fill - * - * Find and extend numa_meminfo memblks to cover the physical - * address range @start-@end - * - * RETURNS: - * 0 : Success - * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end - */ - -int __init numa_fill_memblks(u64 start, u64 end) -{ - struct numa_memblk **blk = &numa_memblk_list[0]; - struct numa_meminfo *mi = &numa_meminfo; - int count = 0; - u64 prev_end; - - /* - * Create a list of pointers to numa_meminfo memblks that - * overlap start, end. The list is used to make in-place - * changes that fill out the numa_meminfo memblks. - */ - for (int i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *bi = &mi->blk[i]; - - if (memblock_addrs_overlap(start, end - start, bi->start, - bi->end - bi->start)) { - blk[count] = &mi->blk[i]; - count++; - } - } - if (!count) - return NUMA_NO_MEMBLK; - - /* Sort the list of pointers in memblk->start order */ - sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL); - - /* Make sure the first/last memblks include start/end */ - blk[0]->start = min(blk[0]->start, start); - blk[count - 1]->end = max(blk[count - 1]->end, end); - - /* - * Fill any gaps by tracking the previous memblks - * end address and backfilling to it if needed. - */ - prev_end = blk[0]->end; - for (int i = 1; i < count; i++) { - struct numa_memblk *curr = blk[i]; - - if (prev_end >= curr->start) { - if (prev_end < curr->end) - prev_end = curr->end; - } else { - curr->start = prev_end; - prev_end = curr->end; - } - } - return 0; -} diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 235f8a4eb2fa..33610026b7a3 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include "numa_internal.h" diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h index 86860f279662..a51229a2f5af 100644 --- a/arch/x86/mm/numa_internal.h +++ b/arch/x86/mm/numa_internal.h @@ -5,23 +5,12 @@ #include #include -struct numa_memblk { - u64 start; - u64 end; - int nid; -}; - -struct numa_meminfo { - int nr_blks; - struct numa_memblk blk[NR_NODE_MEMBLKS]; -}; - -void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi); -int __init numa_cleanup_meminfo(struct numa_meminfo *mi); void __init numa_reset_distance(void); void __init x86_numa_init(void); +struct numa_meminfo; + #ifdef CONFIG_NUMA_EMU void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt); diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 44f91f2c6c5d..bec0dcd1f9c3 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -17,6 +17,7 @@ #include #include #include +#include static nodemask_t nodes_found_map = NODE_MASK_NONE; diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c index 5949829a1b00..838747e319a2 100644 --- a/drivers/of/of_numa.c +++ b/drivers/of/of_numa.c @@ -10,6 +10,7 @@ #include #include #include +#include #include diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h new file mode 100644 index 000000000000..6981cf97d2c9 --- /dev/null +++ b/include/linux/numa_memblks.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __NUMA_MEMBLKS_H +#define __NUMA_MEMBLKS_H + +#ifdef CONFIG_NUMA_MEMBLKS +#include + +#define NR_NODE_MEMBLKS (MAX_NUMNODES * 2) + +struct numa_memblk { + u64 start; + u64 end; + int nid; +}; + +struct numa_meminfo { + int nr_blks; + struct numa_memblk blk[NR_NODE_MEMBLKS]; +}; + +extern struct numa_meminfo numa_meminfo __initdata_or_meminfo; +extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; + +int __init numa_add_memblk(int nodeid, u64 start, u64 end); +void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi); + +int __init numa_cleanup_meminfo(struct numa_meminfo *mi); +int __init numa_register_meminfo(struct numa_meminfo *mi); + +void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, + const struct numa_meminfo *mi); + +#endif /* CONFIG_NUMA_MEMBLKS */ + +#endif /* __NUMA_MEMBLKS_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 7b716ac80272..67700a127982 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1267,6 +1267,9 @@ config IOMMU_MM_DATA config EXECMEM bool +config NUMA_MEMBLKS + bool + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 31fad7df64ad..0778dce4aef6 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -118,6 +118,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_NUMA) += numa.o +obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c new file mode 100644 index 000000000000..72f191a94c66 --- /dev/null +++ b/mm/numa_memblks.c @@ -0,0 +1,385 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include +#include +#include + +nodemask_t numa_nodes_parsed __initdata; + +struct numa_meminfo numa_meminfo __initdata_or_meminfo; +struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; + +static int __init numa_add_memblk_to(int nid, u64 start, u64 end, + struct numa_meminfo *mi) +{ + /* ignore zero length blks */ + if (start == end) + return 0; + + /* whine about and ignore invalid blks */ + if (start > end || nid < 0 || nid >= MAX_NUMNODES) { + pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n", + nid, start, end - 1); + return 0; + } + + if (mi->nr_blks >= NR_NODE_MEMBLKS) { + pr_err("too many memblk ranges\n"); + return -EINVAL; + } + + mi->blk[mi->nr_blks].start = start; + mi->blk[mi->nr_blks].end = end; + mi->blk[mi->nr_blks].nid = nid; + mi->nr_blks++; + return 0; +} + +/** + * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo + * @idx: Index of memblk to remove + * @mi: numa_meminfo to remove memblk from + * + * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and + * decrementing @mi->nr_blks. + */ +void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) +{ + mi->nr_blks--; + memmove(&mi->blk[idx], &mi->blk[idx + 1], + (mi->nr_blks - idx) * sizeof(mi->blk[0])); +} + +/** + * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another + * @dst: numa_meminfo to append block to + * @idx: Index of memblk to remove + * @src: numa_meminfo to remove memblk from + */ +static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx, + struct numa_meminfo *src) +{ + dst->blk[dst->nr_blks++] = src->blk[idx]; + numa_remove_memblk_from(idx, src); +} + +/** + * numa_add_memblk - Add one numa_memblk to numa_meminfo + * @nid: NUMA node ID of the new memblk + * @start: Start address of the new memblk + * @end: End address of the new memblk + * + * Add a new memblk to the default numa_meminfo. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int __init numa_add_memblk(int nid, u64 start, u64 end) +{ + return numa_add_memblk_to(nid, start, end, &numa_meminfo); +} + +/** + * numa_cleanup_meminfo - Cleanup a numa_meminfo + * @mi: numa_meminfo to clean up + * + * Sanitize @mi by merging and removing unnecessary memblks. Also check for + * conflicts and clear unused memblks. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int __init numa_cleanup_meminfo(struct numa_meminfo *mi) +{ + const u64 low = 0; + const u64 high = PFN_PHYS(max_pfn); + int i, j, k; + + /* first, trim all entries */ + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *bi = &mi->blk[i]; + + /* move / save reserved memory ranges */ + if (!memblock_overlaps_region(&memblock.memory, + bi->start, bi->end - bi->start)) { + numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi); + continue; + } + + /* make sure all non-reserved blocks are inside the limits */ + bi->start = max(bi->start, low); + + /* preserve info for non-RAM areas above 'max_pfn': */ + if (bi->end > high) { + numa_add_memblk_to(bi->nid, high, bi->end, + &numa_reserved_meminfo); + bi->end = high; + } + + /* and there's no empty block */ + if (bi->start >= bi->end) + numa_remove_memblk_from(i--, mi); + } + + /* merge neighboring / overlapping entries */ + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *bi = &mi->blk[i]; + + for (j = i + 1; j < mi->nr_blks; j++) { + struct numa_memblk *bj = &mi->blk[j]; + u64 start, end; + + /* + * See whether there are overlapping blocks. Whine + * about but allow overlaps of the same nid. They + * will be merged below. + */ + if (bi->end > bj->start && bi->start < bj->end) { + if (bi->nid != bj->nid) { + pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, + bj->nid, bj->start, bj->end - 1); + return -EINVAL; + } + pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, + bj->start, bj->end - 1); + } + + /* + * Join together blocks on the same node, holes + * between which don't overlap with memory on other + * nodes. + */ + if (bi->nid != bj->nid) + continue; + start = min(bi->start, bj->start); + end = max(bi->end, bj->end); + for (k = 0; k < mi->nr_blks; k++) { + struct numa_memblk *bk = &mi->blk[k]; + + if (bi->nid == bk->nid) + continue; + if (start < bk->end && end > bk->start) + break; + } + if (k < mi->nr_blks) + continue; + pr_info("NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, bj->start, + bj->end - 1, start, end - 1); + bi->start = start; + bi->end = end; + numa_remove_memblk_from(j--, mi); + } + } + + /* clear unused ones */ + for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) { + mi->blk[i].start = mi->blk[i].end = 0; + mi->blk[i].nid = NUMA_NO_NODE; + } + + return 0; +} + +/* + * Set nodes, which have memory in @mi, in *@nodemask. + */ +void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, + const struct numa_meminfo *mi) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(mi->blk); i++) + if (mi->blk[i].start != mi->blk[i].end && + mi->blk[i].nid != NUMA_NO_NODE) + node_set(mi->blk[i].nid, *nodemask); +} + +/* + * Mark all currently memblock-reserved physical memory (which covers the + * kernel's own memory ranges) as hot-unswappable. + */ +static void __init numa_clear_kernel_node_hotplug(void) +{ + nodemask_t reserved_nodemask = NODE_MASK_NONE; + struct memblock_region *mb_region; + int i; + + /* + * We have to do some preprocessing of memblock regions, to + * make them suitable for reservation. + * + * At this time, all memory regions reserved by memblock are + * used by the kernel, but those regions are not split up + * along node boundaries yet, and don't necessarily have their + * node ID set yet either. + * + * So iterate over all parsed memory blocks and use those ranges to + * set the nid in memblock.reserved. This will split up the + * memblock regions along node boundaries and will set the node IDs + * as well. + */ + for (i = 0; i < numa_meminfo.nr_blks; i++) { + struct numa_memblk *mb = numa_meminfo.blk + i; + int ret; + + ret = memblock_set_node(mb->start, mb->end - mb->start, + &memblock.reserved, mb->nid); + WARN_ON_ONCE(ret); + } + + /* + * Now go over all reserved memblock regions, to construct a + * node mask of all kernel reserved memory areas. + * + * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, + * numa_meminfo might not include all memblock.reserved + * memory ranges, because quirks such as trim_snb_memory() + * reserve specific pages for Sandy Bridge graphics. ] + */ + for_each_reserved_mem_region(mb_region) { + int nid = memblock_get_region_node(mb_region); + + if (nid != MAX_NUMNODES) + node_set(nid, reserved_nodemask); + } + + /* + * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory + * belonging to the reserved node mask. + * + * Note that this will include memory regions that reside + * on nodes that contain kernel memory - entire nodes + * become hot-unpluggable: + */ + for (i = 0; i < numa_meminfo.nr_blks; i++) { + struct numa_memblk *mb = numa_meminfo.blk + i; + + if (!node_isset(mb->nid, reserved_nodemask)) + continue; + + memblock_clear_hotplug(mb->start, mb->end - mb->start); + } +} + +int __init numa_register_meminfo(struct numa_meminfo *mi) +{ + int i; + + /* Account for nodes with cpus and no memory */ + node_possible_map = numa_nodes_parsed; + numa_nodemask_from_meminfo(&node_possible_map, mi); + if (WARN_ON(nodes_empty(node_possible_map))) + return -EINVAL; + + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *mb = &mi->blk[i]; + + memblock_set_node(mb->start, mb->end - mb->start, + &memblock.memory, mb->nid); + } + + /* + * At very early time, the kernel have to use some memory such as + * loading the kernel image. We cannot prevent this anyway. So any + * node the kernel resides in should be un-hotpluggable. + * + * And when we come here, alloc node data won't fail. + */ + numa_clear_kernel_node_hotplug(); + + /* + * If sections array is gonna be used for pfn -> nid mapping, check + * whether its granularity is fine enough. + */ + if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) { + unsigned long pfn_align = node_map_pfn_alignment(); + + if (pfn_align && pfn_align < PAGES_PER_SECTION) { + pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n", + PFN_PHYS(pfn_align) >> 20, + PFN_PHYS(PAGES_PER_SECTION) >> 20); + return -EINVAL; + } + } + + return 0; +} + +static int __init cmp_memblk(const void *a, const void *b) +{ + const struct numa_memblk *ma = *(const struct numa_memblk **)a; + const struct numa_memblk *mb = *(const struct numa_memblk **)b; + + return (ma->start > mb->start) - (ma->start < mb->start); +} + +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata; + +/** + * numa_fill_memblks - Fill gaps in numa_meminfo memblks + * @start: address to begin fill + * @end: address to end fill + * + * Find and extend numa_meminfo memblks to cover the physical + * address range @start-@end + * + * RETURNS: + * 0 : Success + * NUMA_NO_MEMBLK : No memblks exist in address range @start-@end + */ + +int __init numa_fill_memblks(u64 start, u64 end) +{ + struct numa_memblk **blk = &numa_memblk_list[0]; + struct numa_meminfo *mi = &numa_meminfo; + int count = 0; + u64 prev_end; + + /* + * Create a list of pointers to numa_meminfo memblks that + * overlap start, end. The list is used to make in-place + * changes that fill out the numa_meminfo memblks. + */ + for (int i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *bi = &mi->blk[i]; + + if (memblock_addrs_overlap(start, end - start, bi->start, + bi->end - bi->start)) { + blk[count] = &mi->blk[i]; + count++; + } + } + if (!count) + return NUMA_NO_MEMBLK; + + /* Sort the list of pointers in memblk->start order */ + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL); + + /* Make sure the first/last memblks include start/end */ + blk[0]->start = min(blk[0]->start, start); + blk[count - 1]->end = max(blk[count - 1]->end, end); + + /* + * Fill any gaps by tracking the previous memblks + * end address and backfilling to it if needed. + */ + prev_end = blk[0]->end; + for (int i = 1; i < count; i++) { + struct numa_memblk *curr = blk[i]; + + if (prev_end >= curr->start) { + if (prev_end < curr->end) + prev_end = curr->end; + } else { + curr->start = prev_end; + prev_end = curr->end; + } + } + return 0; +} From 75f9d4cc4eb5e15dd5ce843e33de7f6346a0d4fa Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:02 +0300 Subject: [PATCH 177/416] mm: move numa_distance and related code from x86 to numa_memblks Move code dealing with numa_distance array from arch/x86 to mm/numa_memblks.c This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-19-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/numa.h | 2 - arch/x86/mm/numa.c | 104 ----------------------------------- arch/x86/mm/numa_internal.h | 2 - include/linux/numa_memblks.h | 4 ++ mm/numa_memblks.c | 104 +++++++++++++++++++++++++++++++++++ 5 files changed, 108 insertions(+), 108 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 6e9a50bf03d4..203100500f24 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -23,8 +23,6 @@ extern int numa_off; extern s16 __apicid_to_node[MAX_LOCAL_APIC]; extern nodemask_t numa_nodes_parsed __initdata; -extern void __init numa_set_distance(int from, int to, int distance); - static inline void set_apicid_to_node(int apicid, s16 node) { __apicid_to_node[apicid] = node; diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 0bada905f409..095502095503 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -24,9 +24,6 @@ int numa_off; -static int numa_distance_cnt; -static u8 *numa_distance; - static __init int numa_setup(char *opt) { if (!opt) @@ -118,107 +115,6 @@ void __init setup_node_to_cpumask_map(void) pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids); } -/** - * numa_reset_distance - Reset NUMA distance table - * - * The current table is freed. The next numa_set_distance() call will - * create a new one. - */ -void __init numa_reset_distance(void) -{ - size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]); - - /* numa_distance could be 1LU marking allocation failure, test cnt */ - if (numa_distance_cnt) - memblock_free(numa_distance, size); - numa_distance_cnt = 0; - numa_distance = NULL; /* enable table creation */ -} - -static int __init numa_alloc_distance(void) -{ - nodemask_t nodes_parsed; - size_t size; - int i, j, cnt = 0; - - /* size the new table and allocate it */ - nodes_parsed = numa_nodes_parsed; - numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo); - - for_each_node_mask(i, nodes_parsed) - cnt = i; - cnt++; - size = cnt * cnt * sizeof(numa_distance[0]); - - numa_distance = memblock_alloc(size, PAGE_SIZE); - if (!numa_distance) { - pr_warn("Warning: can't allocate distance table!\n"); - /* don't retry until explicitly reset */ - numa_distance = (void *)1LU; - return -ENOMEM; - } - - numa_distance_cnt = cnt; - - /* fill with the default distances */ - for (i = 0; i < cnt; i++) - for (j = 0; j < cnt; j++) - numa_distance[i * cnt + j] = i == j ? - LOCAL_DISTANCE : REMOTE_DISTANCE; - printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt); - - return 0; -} - -/** - * numa_set_distance - Set NUMA distance from one NUMA to another - * @from: the 'from' node to set distance - * @to: the 'to' node to set distance - * @distance: NUMA distance - * - * Set the distance from node @from to @to to @distance. If distance table - * doesn't exist, one which is large enough to accommodate all the currently - * known nodes will be created. - * - * If such table cannot be allocated, a warning is printed and further - * calls are ignored until the distance table is reset with - * numa_reset_distance(). - * - * If @from or @to is higher than the highest known node or lower than zero - * at the time of table creation or @distance doesn't make sense, the call - * is ignored. - * This is to allow simplification of specific NUMA config implementations. - */ -void __init numa_set_distance(int from, int to, int distance) -{ - if (!numa_distance && numa_alloc_distance() < 0) - return; - - if (from >= numa_distance_cnt || to >= numa_distance_cnt || - from < 0 || to < 0) { - pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n", - from, to, distance); - return; - } - - if ((u8)distance != distance || - (from == to && distance != LOCAL_DISTANCE)) { - pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n", - from, to, distance); - return; - } - - numa_distance[from * numa_distance_cnt + to] = distance; -} - -int __node_distance(int from, int to) -{ - if (from >= numa_distance_cnt || to >= numa_distance_cnt) - return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE; - return numa_distance[from * numa_distance_cnt + to]; -} -EXPORT_SYMBOL(__node_distance); - static int __init numa_register_memblks(struct numa_meminfo *mi) { int nid, err; diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h index a51229a2f5af..249e3aaeadce 100644 --- a/arch/x86/mm/numa_internal.h +++ b/arch/x86/mm/numa_internal.h @@ -5,8 +5,6 @@ #include #include -void __init numa_reset_distance(void); - void __init x86_numa_init(void); struct numa_meminfo; diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index 6981cf97d2c9..968a590535ac 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -7,6 +7,10 @@ #define NR_NODE_MEMBLKS (MAX_NUMNODES * 2) +extern int numa_distance_cnt; +void __init numa_set_distance(int from, int to, int distance); +void __init numa_reset_distance(void); + struct numa_memblk { u64 start; u64 end; diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index 72f191a94c66..e3c3519725d4 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -7,11 +7,115 @@ #include #include +int numa_distance_cnt; +static u8 *numa_distance; + nodemask_t numa_nodes_parsed __initdata; struct numa_meminfo numa_meminfo __initdata_or_meminfo; struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; +/** + * numa_reset_distance - Reset NUMA distance table + * + * The current table is freed. The next numa_set_distance() call will + * create a new one. + */ +void __init numa_reset_distance(void) +{ + size_t size = numa_distance_cnt * numa_distance_cnt * sizeof(numa_distance[0]); + + /* numa_distance could be 1LU marking allocation failure, test cnt */ + if (numa_distance_cnt) + memblock_free(numa_distance, size); + numa_distance_cnt = 0; + numa_distance = NULL; /* enable table creation */ +} + +static int __init numa_alloc_distance(void) +{ + nodemask_t nodes_parsed; + size_t size; + int i, j, cnt = 0; + + /* size the new table and allocate it */ + nodes_parsed = numa_nodes_parsed; + numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo); + + for_each_node_mask(i, nodes_parsed) + cnt = i; + cnt++; + size = cnt * cnt * sizeof(numa_distance[0]); + + numa_distance = memblock_alloc(size, PAGE_SIZE); + if (!numa_distance) { + pr_warn("Warning: can't allocate distance table!\n"); + /* don't retry until explicitly reset */ + numa_distance = (void *)1LU; + return -ENOMEM; + } + + numa_distance_cnt = cnt; + + /* fill with the default distances */ + for (i = 0; i < cnt; i++) + for (j = 0; j < cnt; j++) + numa_distance[i * cnt + j] = i == j ? + LOCAL_DISTANCE : REMOTE_DISTANCE; + printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt); + + return 0; +} + +/** + * numa_set_distance - Set NUMA distance from one NUMA to another + * @from: the 'from' node to set distance + * @to: the 'to' node to set distance + * @distance: NUMA distance + * + * Set the distance from node @from to @to to @distance. If distance table + * doesn't exist, one which is large enough to accommodate all the currently + * known nodes will be created. + * + * If such table cannot be allocated, a warning is printed and further + * calls are ignored until the distance table is reset with + * numa_reset_distance(). + * + * If @from or @to is higher than the highest known node or lower than zero + * at the time of table creation or @distance doesn't make sense, the call + * is ignored. + * This is to allow simplification of specific NUMA config implementations. + */ +void __init numa_set_distance(int from, int to, int distance) +{ + if (!numa_distance && numa_alloc_distance() < 0) + return; + + if (from >= numa_distance_cnt || to >= numa_distance_cnt || + from < 0 || to < 0) { + pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n", + from, to, distance); + return; + } + + if ((u8)distance != distance || + (from == to && distance != LOCAL_DISTANCE)) { + pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n", + from, to, distance); + return; + } + + numa_distance[from * numa_distance_cnt + to] = distance; +} + +int __node_distance(int from, int to) +{ + if (from >= numa_distance_cnt || to >= numa_distance_cnt) + return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE; + return numa_distance[from * numa_distance_cnt + to]; +} +EXPORT_SYMBOL(__node_distance); + static int __init numa_add_memblk_to(int nid, u64 start, u64 end, struct numa_meminfo *mi) { From b0c4e27c687177aa36048110a12cba94bf291452 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:03 +0300 Subject: [PATCH 178/416] mm: introduce numa_emulation Move numa_emulation code from arch/x86 to mm/numa_emulation.c This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-20-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 8 -------- arch/x86/include/asm/numa.h | 12 ------------ arch/x86/mm/Makefile | 1 - arch/x86/mm/numa_internal.h | 11 ----------- include/linux/numa_memblks.h | 17 +++++++++++++++++ mm/Kconfig | 8 ++++++++ mm/Makefile | 1 + {arch/x86/mm => mm}/numa_emulation.c | 4 +--- 8 files changed, 27 insertions(+), 35 deletions(-) rename {arch/x86/mm => mm}/numa_emulation.c (99%) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 74afb59c6603..acd9745bf2ae 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1600,14 +1600,6 @@ config X86_64_ACPI_NUMA help Enable ACPI SRAT based node topology detection. -config NUMA_EMU - bool "NUMA emulation" - depends on NUMA - help - Enable NUMA emulation. A flat machine will be split - into virtual nodes when booted with "numa=fake=N", where N is the - number of nodes. This is only useful for debugging. - config NODES_SHIFT int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP range 1 10 diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 203100500f24..5469d7a7c40f 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -65,16 +65,4 @@ static inline void init_gi_nodes(void) { } void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable); #endif -#ifdef CONFIG_NUMA_EMU -int numa_emu_cmdline(char *str); -void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, - unsigned int nr_emu_nids); -u64 __init numa_emu_dma_end(void); -#else /* CONFIG_NUMA_EMU */ -static inline int numa_emu_cmdline(char *str) -{ - return -EINVAL; -} -#endif /* CONFIG_NUMA_EMU */ - #endif /* _ASM_X86_NUMA_H */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 8d3a00e5c528..690fbf48e853 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -57,7 +57,6 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o obj-$(CONFIG_AMD_NUMA) += amdtopology.o obj-$(CONFIG_ACPI_NUMA) += srat.o -obj-$(CONFIG_NUMA_EMU) += numa_emulation.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h index 249e3aaeadce..11e1ff370c10 100644 --- a/arch/x86/mm/numa_internal.h +++ b/arch/x86/mm/numa_internal.h @@ -7,15 +7,4 @@ void __init x86_numa_init(void); -struct numa_meminfo; - -#ifdef CONFIG_NUMA_EMU -void __init numa_emulation(struct numa_meminfo *numa_meminfo, - int numa_dist_cnt); -#else -static inline void numa_emulation(struct numa_meminfo *numa_meminfo, - int numa_dist_cnt) -{ } -#endif - #endif /* __X86_MM_NUMA_INTERNAL_H */ diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index 968a590535ac..f81f98678074 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -34,6 +34,23 @@ int __init numa_register_meminfo(struct numa_meminfo *mi); void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, const struct numa_meminfo *mi); +#ifdef CONFIG_NUMA_EMU +int numa_emu_cmdline(char *str); +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids); +u64 __init numa_emu_dma_end(void); +void __init numa_emulation(struct numa_meminfo *numa_meminfo, + int numa_dist_cnt); +#else +static inline void numa_emulation(struct numa_meminfo *numa_meminfo, + int numa_dist_cnt) +{ } +static inline int numa_emu_cmdline(char *str) +{ + return -EINVAL; +} +#endif /* CONFIG_NUMA_EMU */ + #endif /* CONFIG_NUMA_MEMBLKS */ #endif /* __NUMA_MEMBLKS_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 67700a127982..3936fe4d26d9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1270,6 +1270,14 @@ config EXECMEM config NUMA_MEMBLKS bool +config NUMA_EMU + bool "NUMA emulation" + depends on NUMA_MEMBLKS + help + Enable NUMA emulation. A flat machine will be split + into virtual nodes when booted with "numa=fake=N", where N is the + number of nodes. This is only useful for debugging. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 0778dce4aef6..d5639b036166 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -119,6 +119,7 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_NUMA) += numa.o obj-$(CONFIG_NUMA_MEMBLKS) += numa_memblks.o +obj-$(CONFIG_NUMA_EMU) += numa_emulation.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o diff --git a/arch/x86/mm/numa_emulation.c b/mm/numa_emulation.c similarity index 99% rename from arch/x86/mm/numa_emulation.c rename to mm/numa_emulation.c index 33610026b7a3..031fb9961bf7 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/mm/numa_emulation.c @@ -7,9 +7,7 @@ #include #include #include -#include - -#include "numa_internal.h" +#include #define FAKE_NODE_MIN_SIZE ((u64)32 << 20) #define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) From 692d73d2f0f75e58828ab05872b3789ae1f486ef Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:04 +0300 Subject: [PATCH 179/416] mm: numa_memblks: introduce numa_memblks_init Move most of x86::numa_init() to numa_memblks so that the latter will be more self-contained. With this numa_memblk data structures should not be exposed to the architecture specific code. Link: https://lkml.kernel.org/r/20240807064110.1003856-21-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/mm/numa.c | 40 ++++------------------------------- include/linux/numa_memblks.h | 3 +++ mm/numa_memblks.c | 41 ++++++++++++++++++++++++++++++++++++ 3 files changed, 48 insertions(+), 36 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 095502095503..d23287611449 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -115,13 +115,9 @@ void __init setup_node_to_cpumask_map(void) pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids); } -static int __init numa_register_memblks(struct numa_meminfo *mi) +static int __init numa_register_nodes(void) { - int nid, err; - - err = numa_register_meminfo(mi); - if (err) - return err; + int nid; if (!memblock_validate_numa_coverage(SZ_1M)) return -EINVAL; @@ -175,39 +171,11 @@ static int __init numa_init(int (*init_func)(void)) for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE); - nodes_clear(numa_nodes_parsed); - nodes_clear(node_possible_map); - nodes_clear(node_online_map); - memset(&numa_meminfo, 0, sizeof(numa_meminfo)); - WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory, - NUMA_NO_NODE)); - WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved, - NUMA_NO_NODE)); - /* In case that parsing SRAT failed. */ - WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX)); - numa_reset_distance(); - - ret = init_func(); + ret = numa_memblks_init(init_func, /* memblock_force_top_down */ true); if (ret < 0) return ret; - /* - * We reset memblock back to the top-down direction - * here because if we configured ACPI_NUMA, we have - * parsed SRAT in init_func(). It is ok to have the - * reset here even if we did't configure ACPI_NUMA - * or acpi numa init fails and fallbacks to dummy - * numa init. - */ - memblock_set_bottom_up(false); - - ret = numa_cleanup_meminfo(&numa_meminfo); - if (ret < 0) - return ret; - - numa_emulation(&numa_meminfo, numa_distance_cnt); - - ret = numa_register_memblks(&numa_meminfo); + ret = numa_register_nodes(); if (ret < 0) return ret; diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index f81f98678074..07381320848f 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -34,6 +34,9 @@ int __init numa_register_meminfo(struct numa_meminfo *mi); void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, const struct numa_meminfo *mi); +int __init numa_memblks_init(int (*init_func)(void), + bool memblock_force_top_down); + #ifdef CONFIG_NUMA_EMU int numa_emu_cmdline(char *str); void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index e3c3519725d4..7749b6f6b250 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -415,6 +415,47 @@ int __init numa_register_meminfo(struct numa_meminfo *mi) return 0; } +int __init numa_memblks_init(int (*init_func)(void), + bool memblock_force_top_down) +{ + int ret; + + nodes_clear(numa_nodes_parsed); + nodes_clear(node_possible_map); + nodes_clear(node_online_map); + memset(&numa_meminfo, 0, sizeof(numa_meminfo)); + WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory, + NUMA_NO_NODE)); + WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved, + NUMA_NO_NODE)); + /* In case that parsing SRAT failed. */ + WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX)); + numa_reset_distance(); + + ret = init_func(); + if (ret < 0) + return ret; + + /* + * We reset memblock back to the top-down direction + * here because if we configured ACPI_NUMA, we have + * parsed SRAT in init_func(). It is ok to have the + * reset here even if we did't configure ACPI_NUMA + * or acpi numa init fails and fallbacks to dummy + * numa init. + */ + if (memblock_force_top_down) + memblock_set_bottom_up(false); + + ret = numa_cleanup_meminfo(&numa_meminfo); + if (ret < 0) + return ret; + + numa_emulation(&numa_meminfo, numa_distance_cnt); + + return numa_register_meminfo(&numa_meminfo); +} + static int __init cmp_memblk(const void *a, const void *b) { const struct numa_memblk *ma = *(const struct numa_memblk **)a; From 317ef4598bdc48c0196adc02ed4edaf82af279f2 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:05 +0300 Subject: [PATCH 180/416] mm: numa_memblks: make several functions and variables static Make functions and variables that are exclusively used by numa_memblks static. Move numa_nodemask_from_meminfo() before its callers to avoid forward declaration. Link: https://lkml.kernel.org/r/20240807064110.1003856-22-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/numa_memblks.h | 8 -------- mm/numa_memblks.c | 36 ++++++++++++++++++------------------ 2 files changed, 18 insertions(+), 26 deletions(-) diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index 07381320848f..5c6e12ad0b7a 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -7,7 +7,6 @@ #define NR_NODE_MEMBLKS (MAX_NUMNODES * 2) -extern int numa_distance_cnt; void __init numa_set_distance(int from, int to, int distance); void __init numa_reset_distance(void); @@ -22,17 +21,10 @@ struct numa_meminfo { struct numa_memblk blk[NR_NODE_MEMBLKS]; }; -extern struct numa_meminfo numa_meminfo __initdata_or_meminfo; -extern struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; - int __init numa_add_memblk(int nodeid, u64 start, u64 end); void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi); int __init numa_cleanup_meminfo(struct numa_meminfo *mi); -int __init numa_register_meminfo(struct numa_meminfo *mi); - -void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, - const struct numa_meminfo *mi); int __init numa_memblks_init(int (*init_func)(void), bool memblock_force_top_down); diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index 7749b6f6b250..e97665a5e8ce 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -7,13 +7,27 @@ #include #include -int numa_distance_cnt; +static int numa_distance_cnt; static u8 *numa_distance; nodemask_t numa_nodes_parsed __initdata; -struct numa_meminfo numa_meminfo __initdata_or_meminfo; -struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; +static struct numa_meminfo numa_meminfo __initdata_or_meminfo; +static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; + +/* + * Set nodes, which have memory in @mi, in *@nodemask. + */ +static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, + const struct numa_meminfo *mi) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(mi->blk); i++) + if (mi->blk[i].start != mi->blk[i].end && + mi->blk[i].nid != NUMA_NO_NODE) + node_set(mi->blk[i].nid, *nodemask); +} /** * numa_reset_distance - Reset NUMA distance table @@ -290,20 +304,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) return 0; } -/* - * Set nodes, which have memory in @mi, in *@nodemask. - */ -void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, - const struct numa_meminfo *mi) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(mi->blk); i++) - if (mi->blk[i].start != mi->blk[i].end && - mi->blk[i].nid != NUMA_NO_NODE) - node_set(mi->blk[i].nid, *nodemask); -} - /* * Mark all currently memblock-reserved physical memory (which covers the * kernel's own memory ranges) as hot-unswappable. @@ -371,7 +371,7 @@ static void __init numa_clear_kernel_node_hotplug(void) } } -int __init numa_register_meminfo(struct numa_meminfo *mi) +static int __init numa_register_meminfo(struct numa_meminfo *mi) { int i; From f7feea289f9ae3a8fb56e9daa3832949bf742c53 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:06 +0300 Subject: [PATCH 181/416] mm: numa_memblks: use memblock_{start,end}_of_DRAM() when sanitizing meminfo numa_cleanup_meminfo() moves blocks outside system RAM to numa_reserved_meminfo and it uses 0 and PFN_PHYS(max_pfn) to determine the memory boundaries. Replace the memory range boundaries with more portable memblock_start_of_DRAM() and memblock_end_of_DRAM(). Link: https://lkml.kernel.org/r/20240807064110.1003856-23-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- mm/numa_memblks.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index e97665a5e8ce..e4358ad92233 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -212,8 +212,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end) */ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) { - const u64 low = 0; - const u64 high = PFN_PHYS(max_pfn); + const u64 low = memblock_start_of_DRAM(); + const u64 high = memblock_end_of_DRAM(); int i, j, k; /* first, trim all entries */ From 7e488677a54aecc78deee2f7fc464fd603b7dcfe Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:07 +0300 Subject: [PATCH 182/416] of, numa: return -EINVAL when no numa-node-id is found Currently of_numa_parse_memory_nodes() returns 0 if no "memory" node in device tree contains "numa-node-id" property. This makes of_numa_init() to return "success" despite no NUMA nodes were actually parsed and set up. arch_numa workarounds this by returning an error if numa_nodes_parsed is empty. numa_memblks however would WARN() in such case and since it will be used by arch_numa shortly, such warning is not desirable. Make sure of_numa_init() returns -EINVAL when no NUMA node information was found in the device tree. Link: https://lkml.kernel.org/r/20240807064110.1003856-24-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- drivers/of/of_numa.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c index 838747e319a2..2ec20886d176 100644 --- a/drivers/of/of_numa.c +++ b/drivers/of/of_numa.c @@ -45,7 +45,7 @@ static int __init of_numa_parse_memory_nodes(void) struct device_node *np = NULL; struct resource rsrc; u32 nid; - int i, r; + int i, r = -EINVAL; for_each_node_by_type(np, "memory") { r = of_property_read_u32(np, "numa-node-id", &nid); @@ -72,7 +72,7 @@ static int __init of_numa_parse_memory_nodes(void) } } - return 0; + return r; } static int __init of_numa_parse_distance_map_v1(struct device_node *map) From 767507654c22578ea0b51d181211b2e7714ea7cd Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:08 +0300 Subject: [PATCH 183/416] arch_numa: switch over to numa_memblks Until now arch_numa was directly translating firmware NUMA information to memblock. Using numa_memblks as an intermediate step has a few advantages: * alignment with more battle tested x86 implementation * availability of NUMA emulation * maintaining node information for not yet populated memory Adjust a few places in numa_memblks to compile with 32-bit phys_addr_t and replace current functionality related to numa_add_memblk() and __node_distance() in arch_numa with the implementation based on numa_memblks and add functions required by numa_emulation. [rppt@kernel.org: fix section mismatch] Link: https://lkml.kernel.org/r/ZrO6cExVz1He_yPn@kernel.org [rppt@kernel.org: PFN_PHYS() translation is unnecessary here] Link: https://lkml.kernel.org/r/Zs2T5wkSYO9MGcab@kernel.org Link: https://lkml.kernel.org/r/20240807064110.1003856-25-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Tested-by: Zi Yan # for x86_64 and arm64 Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- drivers/base/Kconfig | 1 + drivers/base/arch_numa.c | 201 +++++++++++-------------------------- include/asm-generic/numa.h | 8 +- mm/numa_memblks.c | 17 ++-- 4 files changed, 76 insertions(+), 151 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 2b8fd6bb7da0..064eb52ff7e2 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -226,6 +226,7 @@ config GENERIC_ARCH_TOPOLOGY config GENERIC_ARCH_NUMA bool + select NUMA_MEMBLKS help Enable support for generic NUMA implementation. Currently, RISC-V and ARM64 use it. diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index b6af7475ec44..e18701676426 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -12,14 +12,12 @@ #include #include #include +#include #include -nodemask_t numa_nodes_parsed __initdata; static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE }; -static int numa_distance_cnt; -static u8 *numa_distance; bool numa_off; static __init int numa_parse_early_param(char *opt) @@ -28,6 +26,8 @@ static __init int numa_parse_early_param(char *opt) return -EINVAL; if (str_has_prefix(opt, "off")) numa_off = true; + if (!strncmp(opt, "fake=", 5)) + return numa_emu_cmdline(opt + 5); return 0; } @@ -59,6 +59,7 @@ EXPORT_SYMBOL(cpumask_of_node); #endif +#ifndef CONFIG_NUMA_EMU static void numa_update_cpu(unsigned int cpu, bool remove) { int nid = cpu_to_node(cpu); @@ -81,6 +82,7 @@ void numa_remove_cpu(unsigned int cpu) { numa_update_cpu(cpu, true); } +#endif void numa_clear_node(unsigned int cpu) { @@ -142,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid) unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; EXPORT_SYMBOL(__per_cpu_offset); -int __init early_cpu_to_node(int cpu) +int early_cpu_to_node(int cpu) { return cpu_to_node_map[cpu]; } @@ -187,30 +189,6 @@ void __init setup_per_cpu_areas(void) } #endif -/** - * numa_add_memblk() - Set node id to memblk - * @nid: NUMA node ID of the new memblk - * @start: Start address of the new memblk - * @end: End address of the new memblk - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int __init numa_add_memblk(int nid, u64 start, u64 end) -{ - int ret; - - ret = memblock_set_node(start, (end - start), &memblock.memory, nid); - if (ret < 0) { - pr_err("memblock [0x%llx - 0x%llx] failed to add on node %d\n", - start, (end - 1), nid); - return ret; - } - - node_set(nid, numa_nodes_parsed); - return ret; -} - /* * Initialize NODE_DATA for a node on the local memory */ @@ -226,116 +204,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn; } -/* - * numa_free_distance - * - * The current table is freed. - */ -void __init numa_free_distance(void) -{ - size_t size; - - if (!numa_distance) - return; - - size = numa_distance_cnt * numa_distance_cnt * - sizeof(numa_distance[0]); - - memblock_free(numa_distance, size); - numa_distance_cnt = 0; - numa_distance = NULL; -} - -/* - * Create a new NUMA distance table. - */ -static int __init numa_alloc_distance(void) -{ - size_t size; - int i, j; - - size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]); - numa_distance = memblock_alloc(size, PAGE_SIZE); - if (WARN_ON(!numa_distance)) - return -ENOMEM; - - numa_distance_cnt = nr_node_ids; - - /* fill with the default distances */ - for (i = 0; i < numa_distance_cnt; i++) - for (j = 0; j < numa_distance_cnt; j++) - numa_distance[i * numa_distance_cnt + j] = i == j ? - LOCAL_DISTANCE : REMOTE_DISTANCE; - - pr_debug("Initialized distance table, cnt=%d\n", numa_distance_cnt); - - return 0; -} - -/** - * numa_set_distance() - Set inter node NUMA distance from node to node. - * @from: the 'from' node to set distance - * @to: the 'to' node to set distance - * @distance: NUMA distance - * - * Set the distance from node @from to @to to @distance. - * If distance table doesn't exist, a warning is printed. - * - * If @from or @to is higher than the highest known node or lower than zero - * or @distance doesn't make sense, the call is ignored. - */ -void __init numa_set_distance(int from, int to, int distance) -{ - if (!numa_distance) { - pr_warn_once("Warning: distance table not allocated yet\n"); - return; - } - - if (from >= numa_distance_cnt || to >= numa_distance_cnt || - from < 0 || to < 0) { - pr_warn_once("Warning: node ids are out of bound, from=%d to=%d distance=%d\n", - from, to, distance); - return; - } - - if ((u8)distance != distance || - (from == to && distance != LOCAL_DISTANCE)) { - pr_warn_once("Warning: invalid distance parameter, from=%d to=%d distance=%d\n", - from, to, distance); - return; - } - - numa_distance[from * numa_distance_cnt + to] = distance; -} - -/* - * Return NUMA distance @from to @to - */ -int __node_distance(int from, int to) -{ - if (from >= numa_distance_cnt || to >= numa_distance_cnt) - return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE; - return numa_distance[from * numa_distance_cnt + to]; -} -EXPORT_SYMBOL(__node_distance); - static int __init numa_register_nodes(void) { int nid; - struct memblock_region *mblk; - - /* Check that valid nid is set to memblks */ - for_each_mem_region(mblk) { - int mblk_nid = memblock_get_region_node(mblk); - phys_addr_t start = mblk->base; - phys_addr_t end = mblk->base + mblk->size - 1; - - if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) { - pr_warn("Warning: invalid memblk node %d [mem %pap-%pap]\n", - mblk_nid, &start, &end); - return -EINVAL; - } - } /* Finally register nodes. */ for_each_node_mask(nid, numa_nodes_parsed) { @@ -360,11 +231,7 @@ static int __init numa_init(int (*init_func)(void)) nodes_clear(node_possible_map); nodes_clear(node_online_map); - ret = numa_alloc_distance(); - if (ret < 0) - return ret; - - ret = init_func(); + ret = numa_memblks_init(init_func, /* memblock_force_top_down */ false); if (ret < 0) goto out_free_distance; @@ -382,7 +249,7 @@ static int __init numa_init(int (*init_func)(void)) return 0; out_free_distance: - numa_free_distance(); + numa_reset_distance(); return ret; } @@ -412,6 +279,7 @@ static int __init dummy_numa_init(void) pr_err("NUMA init failed\n"); return ret; } + node_set(0, numa_nodes_parsed); numa_off = true; return 0; @@ -454,3 +322,54 @@ void __init arch_numa_init(void) numa_init(dummy_numa_init); } + +#ifdef CONFIG_NUMA_EMU +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids) +{ + int i, j; + + /* + * Transform cpu_to_node_map table to use emulated nids by + * reverse-mapping phys_nid. The maps should always exist but fall + * back to zero just in case. + */ + for (i = 0; i < ARRAY_SIZE(cpu_to_node_map); i++) { + if (cpu_to_node_map[i] == NUMA_NO_NODE) + continue; + for (j = 0; j < nr_emu_nids; j++) + if (cpu_to_node_map[i] == emu_nid_to_phys[j]) + break; + cpu_to_node_map[i] = j < nr_emu_nids ? j : 0; + } +} + +u64 __init numa_emu_dma_end(void) +{ + return memblock_start_of_DRAM() + SZ_4G; +} + +void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable) +{ + struct cpumask *mask; + + if (node == NUMA_NO_NODE) + return; + + mask = node_to_cpumask_map[node]; + if (!cpumask_available(mask)) { + pr_err("node_to_cpumask_map[%i] NULL\n", node); + dump_stack(); + return; + } + + if (enable) + cpumask_set_cpu(cpu, mask); + else + cpumask_clear_cpu(cpu, mask); + + pr_debug("%s cpu %d node %d: mask now %*pbl\n", + enable ? "numa_add_cpu" : "numa_remove_cpu", + cpu, node, cpumask_pr_args(mask)); +} +#endif /* CONFIG_NUMA_EMU */ diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h index c32e0cf23c90..e063d6487f66 100644 --- a/include/asm-generic/numa.h +++ b/include/asm-generic/numa.h @@ -32,10 +32,8 @@ static inline const struct cpumask *cpumask_of_node(int node) void __init arch_numa_init(void); int __init numa_add_memblk(int nodeid, u64 start, u64 end); -void __init numa_set_distance(int from, int to, int distance); -void __init numa_free_distance(void); void __init early_map_cpu_to_node(unsigned int cpu, int nid); -int __init early_cpu_to_node(int cpu); +int early_cpu_to_node(int cpu); void numa_store_cpu_info(unsigned int cpu); void numa_add_cpu(unsigned int cpu); void numa_remove_cpu(unsigned int cpu); @@ -51,4 +49,8 @@ static inline int early_cpu_to_node(int cpu) { return 0; } #endif /* CONFIG_NUMA */ +#ifdef CONFIG_NUMA_EMU +void debug_cpumask_set_cpu(unsigned int cpu, int node, bool enable); +#endif + #endif /* __ASM_GENERIC_NUMA_H */ diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index e4358ad92233..c4037faa438b 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -405,9 +405,12 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi) unsigned long pfn_align = node_map_pfn_alignment(); if (pfn_align && pfn_align < PAGES_PER_SECTION) { - pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n", - PFN_PHYS(pfn_align) >> 20, - PFN_PHYS(PAGES_PER_SECTION) >> 20); + unsigned long node_align_mb = PFN_PHYS(pfn_align) >> 20; + + unsigned long sect_align_mb = PFN_PHYS(PAGES_PER_SECTION) >> 20; + + pr_warn("Node alignment %luMB < min %luMB, rejecting NUMA config\n", + node_align_mb, sect_align_mb); return -EINVAL; } } @@ -418,18 +421,18 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi) int __init numa_memblks_init(int (*init_func)(void), bool memblock_force_top_down) { + phys_addr_t max_addr = (phys_addr_t)ULLONG_MAX; int ret; nodes_clear(numa_nodes_parsed); nodes_clear(node_possible_map); nodes_clear(node_online_map); memset(&numa_meminfo, 0, sizeof(numa_meminfo)); - WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory, - NUMA_NO_NODE)); - WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved, + WARN_ON(memblock_set_node(0, max_addr, &memblock.memory, NUMA_NO_NODE)); + WARN_ON(memblock_set_node(0, max_addr, &memblock.reserved, NUMA_NO_NODE)); /* In case that parsing SRAT failed. */ - WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX)); + WARN_ON(memblock_clear_hotplug(0, max_addr)); numa_reset_distance(); ret = init_func(); From 1b5695b02444660297cc54cab44f123ff28de2cc Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:09 +0300 Subject: [PATCH 184/416] mm: make range-to-target_node lookup facility a part of numa_memblks The x86 implementation of range-to-target_node lookup (i.e. phys_to_target_node() and memory_add_physaddr_to_nid()) relies on numa_memblks. Since numa_memblks are now part of the generic code, move these functions from x86 to mm/numa_memblks.c and select CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl. [rppt@kernel.org: fix build] Link: https://lkml.kernel.org/r/ZtVfSt_zloPdDqVB@kernel.org Link: https://lkml.kernel.org/r/20240807064110.1003856-26-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Zi Yan # for x86_64 and arm64 Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Reviewed-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- arch/x86/include/asm/sparsemem.h | 9 -------- arch/x86/mm/numa.c | 38 -------------------------------- drivers/cxl/Kconfig | 2 +- drivers/dax/Kconfig | 2 +- include/linux/numa_memblks.h | 7 ++++++ mm/numa.c | 1 + mm/numa_memblks.c | 38 ++++++++++++++++++++++++++++++++ 7 files changed, 48 insertions(+), 49 deletions(-) diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h index 64df897c0ee3..3918c7a434f5 100644 --- a/arch/x86/include/asm/sparsemem.h +++ b/arch/x86/include/asm/sparsemem.h @@ -31,13 +31,4 @@ #endif /* CONFIG_SPARSEMEM */ -#ifndef __ASSEMBLY__ -#ifdef CONFIG_NUMA_KEEP_MEMINFO -extern int phys_to_target_node(phys_addr_t start); -#define phys_to_target_node phys_to_target_node -extern int memory_add_physaddr_to_nid(u64 start); -#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid -#endif -#endif /* __ASSEMBLY__ */ - #endif /* _ASM_X86_SPARSEMEM_H */ diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index d23287611449..64e5cdb2460a 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -453,41 +453,3 @@ u64 __init numa_emu_dma_end(void) return PFN_PHYS(MAX_DMA32_PFN); } #endif /* CONFIG_NUMA_EMU */ - -#ifdef CONFIG_NUMA_KEEP_MEMINFO -static int meminfo_to_nid(struct numa_meminfo *mi, u64 start) -{ - int i; - - for (i = 0; i < mi->nr_blks; i++) - if (mi->blk[i].start <= start && mi->blk[i].end > start) - return mi->blk[i].nid; - return NUMA_NO_NODE; -} - -int phys_to_target_node(phys_addr_t start) -{ - int nid = meminfo_to_nid(&numa_meminfo, start); - - /* - * Prefer online nodes, but if reserved memory might be - * hot-added continue the search with reserved ranges. - */ - if (nid != NUMA_NO_NODE) - return nid; - - return meminfo_to_nid(&numa_reserved_meminfo, start); -} -EXPORT_SYMBOL_GPL(phys_to_target_node); - -int memory_add_physaddr_to_nid(u64 start) -{ - int nid = meminfo_to_nid(&numa_meminfo, start); - - if (nid == NUMA_NO_NODE) - nid = numa_meminfo.blk[0].nid; - return nid; -} -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); - -#endif diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 99b5c25be079..29c192f20082 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -6,7 +6,7 @@ menuconfig CXL_BUS select FW_UPLOAD select PCI_DOE select FIRMWARE_TABLE - select NUMA_KEEP_MEMINFO if (NUMA && X86) + select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS help CXL is a bus that is electrically compatible with PCI Express, but layers three protocols on that signalling (CXL.io, CXL.cache, and diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index a88744244149..d656e4c0eb84 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -30,7 +30,7 @@ config DEV_DAX_PMEM config DEV_DAX_HMEM tristate "HMEM DAX: direct access to 'specific purpose' memory" depends on EFI_SOFT_RESERVE - select NUMA_KEEP_MEMINFO if (NUMA && X86) + select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS default DEV_DAX help EFI 2.8 platforms, and others, may advertise 'specific purpose' diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h index 5c6e12ad0b7a..cfad6ce7e1bd 100644 --- a/include/linux/numa_memblks.h +++ b/include/linux/numa_memblks.h @@ -46,6 +46,13 @@ static inline int numa_emu_cmdline(char *str) } #endif /* CONFIG_NUMA_EMU */ +#ifdef CONFIG_NUMA_KEEP_MEMINFO +extern int phys_to_target_node(u64 start); +#define phys_to_target_node phys_to_target_node +extern int memory_add_physaddr_to_nid(u64 start); +#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid +#endif /* CONFIG_NUMA_KEEP_MEMINFO */ + #endif /* CONFIG_NUMA_MEMBLKS */ #endif /* __NUMA_MEMBLKS_H */ diff --git a/mm/numa.c b/mm/numa.c index 1f1582dcdf4a..e2eec07707d1 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -3,6 +3,7 @@ #include #include #include +#include struct pglist_data *node_data[MAX_NUMNODES]; EXPORT_SYMBOL(node_data); diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index c4037faa438b..be52b93a9c58 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -531,3 +531,41 @@ int __init numa_fill_memblks(u64 start, u64 end) } return 0; } + +#ifdef CONFIG_NUMA_KEEP_MEMINFO +static int meminfo_to_nid(struct numa_meminfo *mi, u64 start) +{ + int i; + + for (i = 0; i < mi->nr_blks; i++) + if (mi->blk[i].start <= start && mi->blk[i].end > start) + return mi->blk[i].nid; + return NUMA_NO_NODE; +} + +int phys_to_target_node(u64 start) +{ + int nid = meminfo_to_nid(&numa_meminfo, start); + + /* + * Prefer online nodes, but if reserved memory might be + * hot-added continue the search with reserved ranges. + */ + if (nid != NUMA_NO_NODE) + return nid; + + return meminfo_to_nid(&numa_reserved_meminfo, start); +} +EXPORT_SYMBOL_GPL(phys_to_target_node); + +int memory_add_physaddr_to_nid(u64 start) +{ + int nid = meminfo_to_nid(&numa_meminfo, start); + + if (nid == NUMA_NO_NODE) + nid = numa_meminfo.blk[0].nid; + return nid; +} +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); + +#endif /* CONFIG_NUMA_KEEP_MEMINFO */ From 101d6470805b6b4588d8fbae04f670611a1c63e3 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 7 Aug 2024 09:41:10 +0300 Subject: [PATCH 185/416] docs: move numa=fake description to kernel-parameters.txt NUMA emulation can be now enabled on arm64 and riscv in addition to x86. Move description of numa=fake parameters from x86 documentation of admin-guide/kernel-parameters.txt Link: https://lkml.kernel.org/r/20240807064110.1003856-27-rppt@kernel.org Suggested-by: Zi Yan Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Jonathan Cameron Tested-by: Jonathan Cameron [arm64 + CXL via QEMU] Acked-by: Dan Williams Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christophe Leroy Cc: Dave Hansen Cc: Davidlohr Bueso Cc: David S. Miller Cc: Greg Kroah-Hartman Cc: Heiko Carstens Cc: Huacai Chen Cc: Ingo Molnar Cc: Jiaxun Yang Cc: John Paul Adrian Glaubitz Cc: Jonathan Corbet Cc: Michael Ellerman Cc: Palmer Dabbelt Cc: Rafael J. Wysocki Cc: Rob Herring (Arm) Cc: Samuel Holland Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- Documentation/admin-guide/kernel-parameters.txt | 15 +++++++++++++++ Documentation/arch/x86/x86_64/boot-options.rst | 12 ------------ 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 998076fd4882..8dd0aefea01e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4123,6 +4123,21 @@ Disable NUMA, Only set up a single NUMA node spanning all memory. + numa=fake=[MG] + [KNL, ARM64, RISCV, X86, EARLY] + If given as a memory unit, fills all system RAM with + nodes of size interleaved over physical nodes. + + numa=fake= + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer, fills all system RAM with N + fake nodes interleaved over physical nodes. + + numa=fake=U + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer followed by 'U', it will + divide each physical node into N emulated nodes. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst index 137432d34109..98d4805f0823 100644 --- a/Documentation/arch/x86/x86_64/boot-options.rst +++ b/Documentation/arch/x86/x86_64/boot-options.rst @@ -170,18 +170,6 @@ NUMA Don't parse the HMAT table for NUMA setup, or soft-reserved memory partitioning. - numa=fake=[MG] - If given as a memory unit, fills all system RAM with nodes of - size interleaved over physical nodes. - - numa=fake= - If given as an integer, fills all system RAM with N fake nodes - interleaved over physical nodes. - - numa=fake=U - If given as an integer followed by 'U', it will divide each - physical node into N emulated nodes. - ACPI ==== From b85508d7de901c23a2bd5194d43c57741d2b149f Mon Sep 17 00:00:00 2001 From: Barry Song Date: Thu, 8 Aug 2024 09:58:58 +1200 Subject: [PATCH 186/416] mm: rename instances of swap_info_struct to meaningful 'si' Patch series "mm: batch free swaps for zap_pte_range()", v3. Batch free swap slots for zap_pte_range(), making munmap three times faster when the page table entries are filled with swap entries to be freed. This is likely another advantage of using mTHP. This patch (of 3): "p" means "pointer to something", rename it to a more meaningful identifier - "si". We also have a case with the name "sis", rename it to "si" as well. Link: https://lkml.kernel.org/r/20240807215859.57491-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240807215859.57491-2-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: David Hildenbrand Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kairui Song Cc: Kalesh Singh Cc: Ryan Roberts Cc: Zhiguo Jiang Signed-off-by: Andrew Morton --- mm/swapfile.c | 334 +++++++++++++++++++++++++------------------------- 1 file changed, 167 insertions(+), 167 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b36a16131696..66b567216432 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -532,7 +532,7 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * * added to free cluster list and its usage counter will be increased by 1. * Only used for initialization. */ -static void inc_cluster_info_page(struct swap_info_struct *p, +static void inc_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *cluster_info, unsigned long page_nr) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; @@ -553,28 +553,28 @@ static void inc_cluster_info_page(struct swap_info_struct *p, * which means no page in the cluster is in use, we can optionally discard * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, +static void dec_cluster_info_page(struct swap_info_struct *si, struct swap_cluster_info *ci, int nr_pages) { - if (!p->cluster_info) + if (!si->cluster_info) return; VM_BUG_ON(ci->count < nr_pages); VM_BUG_ON(cluster_is_free(ci)); - lockdep_assert_held(&p->lock); + lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); ci->count -= nr_pages; if (!ci->count) { - free_cluster(p, ci); + free_cluster(si, ci); return; } if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); if (ci->flags & CLUSTER_FLAG_FRAG) - p->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); + si->frag_cluster_nr[ci->order]--; + list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]); ci->flags = CLUSTER_FLAG_NONFULL; } } @@ -872,19 +872,19 @@ done: return found; } -static void __del_from_avail_list(struct swap_info_struct *p) +static void __del_from_avail_list(struct swap_info_struct *si) { int nid; - assert_spin_locked(&p->lock); + assert_spin_locked(&si->lock); for_each_node(nid) - plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); + plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]); } -static void del_from_avail_list(struct swap_info_struct *p) +static void del_from_avail_list(struct swap_info_struct *si) { spin_lock(&swap_avail_lock); - __del_from_avail_list(p); + __del_from_avail_list(si); spin_unlock(&swap_avail_lock); } @@ -905,13 +905,13 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, } } -static void add_to_avail_list(struct swap_info_struct *p) +static void add_to_avail_list(struct swap_info_struct *si) { int nid; spin_lock(&swap_avail_lock); for_each_node(nid) - plist_add(&p->avail_lists[nid], &swap_avail_heads[nid]); + plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]); spin_unlock(&swap_avail_lock); } @@ -1291,22 +1291,22 @@ noswap: static struct swap_info_struct *_swap_info_get(swp_entry_t entry) { - struct swap_info_struct *p; + struct swap_info_struct *si; unsigned long offset; if (!entry.val) goto out; - p = swp_swap_info(entry); - if (!p) + si = swp_swap_info(entry); + if (!si) goto bad_nofile; - if (data_race(!(p->flags & SWP_USED))) + if (data_race(!(si->flags & SWP_USED))) goto bad_device; offset = swp_offset(entry); - if (offset >= p->max) + if (offset >= si->max) goto bad_offset; - if (data_race(!p->swap_map[swp_offset(entry)])) + if (data_race(!si->swap_map[swp_offset(entry)])) goto bad_free; - return p; + return si; bad_free: pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); @@ -1339,14 +1339,14 @@ static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry, return p; } -static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, +static unsigned char __swap_entry_free_locked(struct swap_info_struct *si, unsigned long offset, unsigned char usage) { unsigned char count; unsigned char has_cache; - count = p->swap_map[offset]; + count = si->swap_map[offset]; has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; @@ -1362,7 +1362,7 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, count = 0; } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) { if (count == COUNT_CONTINUED) { - if (swap_count_continued(p, offset, count)) + if (swap_count_continued(si, offset, count)) count = SWAP_MAP_MAX | COUNT_CONTINUED; else count = SWAP_MAP_MAX; @@ -1372,9 +1372,9 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p, usage = count | has_cache; if (usage) - WRITE_ONCE(p->swap_map[offset], usage); + WRITE_ONCE(si->swap_map[offset], usage); else - WRITE_ONCE(p->swap_map[offset], SWAP_HAS_CACHE); + WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE); return usage; } @@ -1453,16 +1453,16 @@ put_out: return NULL; } -static unsigned char __swap_entry_free(struct swap_info_struct *p, +static unsigned char __swap_entry_free(struct swap_info_struct *si, swp_entry_t entry) { struct swap_cluster_info *ci; unsigned long offset = swp_offset(entry); unsigned char usage; - ci = lock_cluster_or_swap_info(p, offset); - usage = __swap_entry_free_locked(p, offset, 1); - unlock_cluster_or_swap_info(p, ci); + ci = lock_cluster_or_swap_info(si, offset); + usage = __swap_entry_free_locked(si, offset, 1); + unlock_cluster_or_swap_info(si, ci); if (!usage) free_swap_slot(entry); @@ -1473,27 +1473,27 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, * Drop the last HAS_CACHE flag of swap entries, caller have to * ensure all entries belong to the same cgroup. */ -static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry, +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry, unsigned int nr_pages) { unsigned long offset = swp_offset(entry); - unsigned char *map = p->swap_map + offset; + unsigned char *map = si->swap_map + offset; unsigned char *map_end = map + nr_pages; struct swap_cluster_info *ci; - ci = lock_cluster(p, offset); + ci = lock_cluster(si, offset); do { VM_BUG_ON(*map != SWAP_HAS_CACHE); *map = 0; } while (++map < map_end); - dec_cluster_info_page(p, ci, nr_pages); + dec_cluster_info_page(si, ci, nr_pages); unlock_cluster(ci); mem_cgroup_uncharge_swap(entry, nr_pages); - swap_range_free(p, offset, nr_pages); + swap_range_free(si, offset, nr_pages); } -static void cluster_swap_free_nr(struct swap_info_struct *sis, +static void cluster_swap_free_nr(struct swap_info_struct *si, unsigned long offset, int nr_pages, unsigned char usage) { @@ -1501,26 +1501,26 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis, DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 }; int i, nr; - ci = lock_cluster_or_swap_info(sis, offset); + ci = lock_cluster_or_swap_info(si, offset); while (nr_pages) { nr = min(BITS_PER_LONG, nr_pages); for (i = 0; i < nr; i++) { - if (!__swap_entry_free_locked(sis, offset + i, usage)) + if (!__swap_entry_free_locked(si, offset + i, usage)) bitmap_set(to_free, i, 1); } if (!bitmap_empty(to_free, BITS_PER_LONG)) { - unlock_cluster_or_swap_info(sis, ci); + unlock_cluster_or_swap_info(si, ci); for_each_set_bit(i, to_free, BITS_PER_LONG) - free_swap_slot(swp_entry(sis->type, offset + i)); + free_swap_slot(swp_entry(si->type, offset + i)); if (nr == nr_pages) return; bitmap_clear(to_free, 0, BITS_PER_LONG); - ci = lock_cluster_or_swap_info(sis, offset); + ci = lock_cluster_or_swap_info(si, offset); } offset += nr; nr_pages -= nr; } - unlock_cluster_or_swap_info(sis, ci); + unlock_cluster_or_swap_info(si, ci); } /* @@ -1646,28 +1646,28 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry) int swp_swapcount(swp_entry_t entry) { int count, tmp_count, n; - struct swap_info_struct *p; + struct swap_info_struct *si; struct swap_cluster_info *ci; struct page *page; pgoff_t offset; unsigned char *map; - p = _swap_info_get(entry); - if (!p) + si = _swap_info_get(entry); + if (!si) return 0; offset = swp_offset(entry); - ci = lock_cluster_or_swap_info(p, offset); + ci = lock_cluster_or_swap_info(si, offset); - count = swap_count(p->swap_map[offset]); + count = swap_count(si->swap_map[offset]); if (!(count & COUNT_CONTINUED)) goto out; count &= ~COUNT_CONTINUED; n = SWAP_MAP_MAX + 1; - page = vmalloc_to_page(p->swap_map + offset); + page = vmalloc_to_page(si->swap_map + offset); offset &= ~PAGE_MASK; VM_BUG_ON(page_private(page) != SWP_CONTINUED); @@ -1681,7 +1681,7 @@ int swp_swapcount(swp_entry_t entry) n *= (SWAP_CONT_MAX + 1); } while (tmp_count & COUNT_CONTINUED); out: - unlock_cluster_or_swap_info(p, ci); + unlock_cluster_or_swap_info(si, ci); return count; } @@ -2539,52 +2539,52 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) return generic_swapfile_activate(sis, swap_file, span); } -static int swap_node(struct swap_info_struct *p) +static int swap_node(struct swap_info_struct *si) { struct block_device *bdev; - if (p->bdev) - bdev = p->bdev; + if (si->bdev) + bdev = si->bdev; else - bdev = p->swap_file->f_inode->i_sb->s_bdev; + bdev = si->swap_file->f_inode->i_sb->s_bdev; return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE; } -static void setup_swap_info(struct swap_info_struct *p, int prio, +static void setup_swap_info(struct swap_info_struct *si, int prio, unsigned char *swap_map, struct swap_cluster_info *cluster_info) { int i; if (prio >= 0) - p->prio = prio; + si->prio = prio; else - p->prio = --least_priority; + si->prio = --least_priority; /* * the plist prio is negated because plist ordering is * low-to-high, while swap ordering is high-to-low */ - p->list.prio = -p->prio; + si->list.prio = -si->prio; for_each_node(i) { - if (p->prio >= 0) - p->avail_lists[i].prio = -p->prio; + if (si->prio >= 0) + si->avail_lists[i].prio = -si->prio; else { - if (swap_node(p) == i) - p->avail_lists[i].prio = 1; + if (swap_node(si) == i) + si->avail_lists[i].prio = 1; else - p->avail_lists[i].prio = -p->prio; + si->avail_lists[i].prio = -si->prio; } } - p->swap_map = swap_map; - p->cluster_info = cluster_info; + si->swap_map = swap_map; + si->cluster_info = cluster_info; } -static void _enable_swap_info(struct swap_info_struct *p) +static void _enable_swap_info(struct swap_info_struct *si) { - p->flags |= SWP_WRITEOK; - atomic_long_add(p->pages, &nr_swap_pages); - total_swap_pages += p->pages; + si->flags |= SWP_WRITEOK; + atomic_long_add(si->pages, &nr_swap_pages); + total_swap_pages += si->pages; assert_spin_locked(&swap_lock); /* @@ -2597,40 +2597,40 @@ static void _enable_swap_info(struct swap_info_struct *p) * which allocates swap pages from the highest available priority * swap_info_struct. */ - plist_add(&p->list, &swap_active_head); + plist_add(&si->list, &swap_active_head); /* add to available list iff swap device is not full */ - if (p->highest_bit) - add_to_avail_list(p); + if (si->highest_bit) + add_to_avail_list(si); } -static void enable_swap_info(struct swap_info_struct *p, int prio, +static void enable_swap_info(struct swap_info_struct *si, int prio, unsigned char *swap_map, struct swap_cluster_info *cluster_info) { spin_lock(&swap_lock); - spin_lock(&p->lock); - setup_swap_info(p, prio, swap_map, cluster_info); - spin_unlock(&p->lock); + spin_lock(&si->lock); + setup_swap_info(si, prio, swap_map, cluster_info); + spin_unlock(&si->lock); spin_unlock(&swap_lock); /* * Finished initializing swap device, now it's safe to reference it. */ - percpu_ref_resurrect(&p->users); + percpu_ref_resurrect(&si->users); spin_lock(&swap_lock); - spin_lock(&p->lock); - _enable_swap_info(p); - spin_unlock(&p->lock); + spin_lock(&si->lock); + _enable_swap_info(si); + spin_unlock(&si->lock); spin_unlock(&swap_lock); } -static void reinsert_swap_info(struct swap_info_struct *p) +static void reinsert_swap_info(struct swap_info_struct *si) { spin_lock(&swap_lock); - spin_lock(&p->lock); - setup_swap_info(p, p->prio, p->swap_map, p->cluster_info); - _enable_swap_info(p); - spin_unlock(&p->lock); + spin_lock(&si->lock); + setup_swap_info(si, si->prio, si->swap_map, si->cluster_info); + _enable_swap_info(si); + spin_unlock(&si->lock); spin_unlock(&swap_lock); } @@ -3016,20 +3016,20 @@ static struct swap_info_struct *alloc_swap_info(void) return p; } -static int claim_swapfile(struct swap_info_struct *p, struct inode *inode) +static int claim_swapfile(struct swap_info_struct *si, struct inode *inode) { if (S_ISBLK(inode->i_mode)) { - p->bdev = I_BDEV(inode); + si->bdev = I_BDEV(inode); /* * Zoned block devices contain zones that have a sequential * write only restriction. Hence zoned block devices are not * suitable for swapping. Disallow them here. */ - if (bdev_is_zoned(p->bdev)) + if (bdev_is_zoned(si->bdev)) return -EINVAL; - p->flags |= SWP_BLKDEV; + si->flags |= SWP_BLKDEV; } else if (S_ISREG(inode->i_mode)) { - p->bdev = inode->i_sb->s_bdev; + si->bdev = inode->i_sb->s_bdev; } return 0; @@ -3064,7 +3064,7 @@ __weak unsigned long arch_max_swapfile_size(void) return generic_max_swapfile_size(); } -static unsigned long read_swap_header(struct swap_info_struct *p, +static unsigned long read_swap_header(struct swap_info_struct *si, union swap_header *swap_header, struct inode *inode) { @@ -3095,9 +3095,9 @@ static unsigned long read_swap_header(struct swap_info_struct *p, return 0; } - p->lowest_bit = 1; - p->cluster_next = 1; - p->cluster_nr = 0; + si->lowest_bit = 1; + si->cluster_next = 1; + si->cluster_nr = 0; maxpages = swapfile_maximum_size; last_page = swap_header->info.last_page; @@ -3115,7 +3115,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p, if ((unsigned int)maxpages == 0) maxpages = UINT_MAX; } - p->highest_bit = maxpages - 1; + si->highest_bit = maxpages - 1; if (!maxpages) return 0; @@ -3139,7 +3139,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p, #define SWAP_CLUSTER_COLS \ max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS) -static int setup_swap_map_and_extents(struct swap_info_struct *p, +static int setup_swap_map_and_extents(struct swap_info_struct *si, union swap_header *swap_header, unsigned char *swap_map, struct swap_cluster_info *cluster_info, @@ -3150,19 +3150,19 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, unsigned int nr_good_pages; int nr_extents; unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; + unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; unsigned long i, idx; nr_good_pages = maxpages - 1; /* omit header page */ - INIT_LIST_HEAD(&p->free_clusters); - INIT_LIST_HEAD(&p->full_clusters); - INIT_LIST_HEAD(&p->discard_clusters); + INIT_LIST_HEAD(&si->free_clusters); + INIT_LIST_HEAD(&si->full_clusters); + INIT_LIST_HEAD(&si->discard_clusters); for (i = 0; i < SWAP_NR_ORDERS; i++) { - INIT_LIST_HEAD(&p->nonfull_clusters[i]); - INIT_LIST_HEAD(&p->frag_clusters[i]); - p->frag_cluster_nr[i] = 0; + INIT_LIST_HEAD(&si->nonfull_clusters[i]); + INIT_LIST_HEAD(&si->frag_clusters[i]); + si->frag_cluster_nr[i] = 0; } for (i = 0; i < swap_header->info.nr_badpages; i++) { @@ -3176,13 +3176,13 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, * Haven't marked the cluster free yet, no list * operation involved */ - inc_cluster_info_page(p, cluster_info, page_nr); + inc_cluster_info_page(si, cluster_info, page_nr); } } /* Haven't marked the cluster free yet, no list operation involved */ for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(p, cluster_info, i); + inc_cluster_info_page(si, cluster_info, i); if (nr_good_pages) { swap_map[0] = SWAP_MAP_BAD; @@ -3190,13 +3190,13 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, * Not mark the cluster free yet, no list * operation involved */ - inc_cluster_info_page(p, cluster_info, 0); - p->max = maxpages; - p->pages = nr_good_pages; - nr_extents = setup_swap_extents(p, span); + inc_cluster_info_page(si, cluster_info, 0); + si->max = maxpages; + si->pages = nr_good_pages; + nr_extents = setup_swap_extents(si, span); if (nr_extents < 0) return nr_extents; - nr_good_pages = p->pages; + nr_good_pages = si->pages; } if (!nr_good_pages) { pr_warn("Empty swap-file\n"); @@ -3220,11 +3220,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, continue; if (ci->count) { ci->flags = CLUSTER_FLAG_NONFULL; - list_add_tail(&ci->list, &p->nonfull_clusters[0]); + list_add_tail(&ci->list, &si->nonfull_clusters[0]); continue; } ci->flags = CLUSTER_FLAG_FREE; - list_add_tail(&ci->list, &p->free_clusters); + list_add_tail(&ci->list, &si->free_clusters); } } return nr_extents; @@ -3232,7 +3232,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) { - struct swap_info_struct *p; + struct swap_info_struct *si; struct filename *name; struct file *swap_file = NULL; struct address_space *mapping; @@ -3258,11 +3258,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (!swap_avail_heads) return -ENOMEM; - p = alloc_swap_info(); - if (IS_ERR(p)) - return PTR_ERR(p); + si = alloc_swap_info(); + if (IS_ERR(si)) + return PTR_ERR(si); - INIT_WORK(&p->discard_work, swap_discard_work); + INIT_WORK(&si->discard_work, swap_discard_work); name = getname(specialfile); if (IS_ERR(name)) { @@ -3277,12 +3277,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap; } - p->swap_file = swap_file; + si->swap_file = swap_file; mapping = swap_file->f_mapping; dentry = swap_file->f_path.dentry; inode = mapping->host; - error = claim_swapfile(p, inode); + error = claim_swapfile(si, inode); if (unlikely(error)) goto bad_swap; @@ -3310,7 +3310,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) } swap_header = kmap(page); - maxpages = read_swap_header(p, swap_header, inode); + maxpages = read_swap_header(si, swap_header, inode); if (unlikely(!maxpages)) { error = -EINVAL; goto bad_swap_unlock_inode; @@ -3323,19 +3323,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } - if (p->bdev && bdev_stable_writes(p->bdev)) - p->flags |= SWP_STABLE_WRITES; + if (si->bdev && bdev_stable_writes(si->bdev)) + si->flags |= SWP_STABLE_WRITES; - if (p->bdev && bdev_synchronous(p->bdev)) - p->flags |= SWP_SYNCHRONOUS_IO; + if (si->bdev && bdev_synchronous(si->bdev)) + si->flags |= SWP_SYNCHRONOUS_IO; - if (p->bdev && bdev_nonrot(p->bdev)) { + if (si->bdev && bdev_nonrot(si->bdev)) { int cpu, i; unsigned long ci, nr_cluster; - p->flags |= SWP_SOLIDSTATE; - p->cluster_next_cpu = alloc_percpu(unsigned int); - if (!p->cluster_next_cpu) { + si->flags |= SWP_SOLIDSTATE; + si->cluster_next_cpu = alloc_percpu(unsigned int); + if (!si->cluster_next_cpu) { error = -ENOMEM; goto bad_swap_unlock_inode; } @@ -3344,8 +3344,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) * SSD */ for_each_possible_cpu(cpu) { - per_cpu(*p->cluster_next_cpu, cpu) = - get_random_u32_inclusive(1, p->highest_bit); + per_cpu(*si->cluster_next_cpu, cpu) = + get_random_u32_inclusive(1, si->highest_bit); } nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); @@ -3359,15 +3359,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) for (ci = 0; ci < nr_cluster; ci++) spin_lock_init(&((cluster_info + ci)->lock)); - p->percpu_cluster = alloc_percpu(struct percpu_cluster); - if (!p->percpu_cluster) { + si->percpu_cluster = alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) { error = -ENOMEM; goto bad_swap_unlock_inode; } for_each_possible_cpu(cpu) { struct percpu_cluster *cluster; - cluster = per_cpu_ptr(p->percpu_cluster, cpu); + cluster = per_cpu_ptr(si->percpu_cluster, cpu); for (i = 0; i < SWAP_NR_ORDERS; i++) cluster->next[i] = SWAP_NEXT_INVALID; } @@ -3376,11 +3376,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) inced_nr_rotate_swap = true; } - error = swap_cgroup_swapon(p->type, maxpages); + error = swap_cgroup_swapon(si->type, maxpages); if (error) goto bad_swap_unlock_inode; - nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map, + nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, cluster_info, maxpages, &span); if (unlikely(nr_extents < 0)) { error = nr_extents; @@ -3388,14 +3388,14 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) } if ((swap_flags & SWAP_FLAG_DISCARD) && - p->bdev && bdev_max_discard_sectors(p->bdev)) { + si->bdev && bdev_max_discard_sectors(si->bdev)) { /* * When discard is enabled for swap with no particular * policy flagged, we set all swap discard flags here in * order to sustain backward compatibility with older * swapon(8) releases. */ - p->flags |= (SWP_DISCARDABLE | SWP_AREA_DISCARD | + si->flags |= (SWP_DISCARDABLE | SWP_AREA_DISCARD | SWP_PAGE_DISCARD); /* @@ -3405,24 +3405,24 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) * Now it's time to adjust the p->flags accordingly. */ if (swap_flags & SWAP_FLAG_DISCARD_ONCE) - p->flags &= ~SWP_PAGE_DISCARD; + si->flags &= ~SWP_PAGE_DISCARD; else if (swap_flags & SWAP_FLAG_DISCARD_PAGES) - p->flags &= ~SWP_AREA_DISCARD; + si->flags &= ~SWP_AREA_DISCARD; /* issue a swapon-time discard if it's still required */ - if (p->flags & SWP_AREA_DISCARD) { - int err = discard_swap(p); + if (si->flags & SWP_AREA_DISCARD) { + int err = discard_swap(si); if (unlikely(err)) pr_err("swapon: discard_swap(%p): %d\n", - p, err); + si, err); } } - error = init_swap_address_space(p->type, maxpages); + error = init_swap_address_space(si->type, maxpages); if (error) goto bad_swap_unlock_inode; - error = zswap_swapon(p->type, maxpages); + error = zswap_swapon(si->type, maxpages); if (error) goto free_swap_address_space; @@ -3442,15 +3442,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (swap_flags & SWAP_FLAG_PREFER) prio = (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; - enable_swap_info(p, prio, swap_map, cluster_info); + enable_swap_info(si, prio, swap_map, cluster_info); pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", - K(p->pages), name->name, p->prio, nr_extents, + K(si->pages), name->name, si->prio, nr_extents, K((unsigned long long)span), - (p->flags & SWP_SOLIDSTATE) ? "SS" : "", - (p->flags & SWP_DISCARDABLE) ? "D" : "", - (p->flags & SWP_AREA_DISCARD) ? "s" : "", - (p->flags & SWP_PAGE_DISCARD) ? "c" : ""); + (si->flags & SWP_SOLIDSTATE) ? "SS" : "", + (si->flags & SWP_DISCARDABLE) ? "D" : "", + (si->flags & SWP_AREA_DISCARD) ? "s" : "", + (si->flags & SWP_PAGE_DISCARD) ? "c" : ""); mutex_unlock(&swapon_mutex); atomic_inc(&proc_poll_event); @@ -3459,22 +3459,22 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) error = 0; goto out; free_swap_zswap: - zswap_swapoff(p->type); + zswap_swapoff(si->type); free_swap_address_space: - exit_swap_address_space(p->type); + exit_swap_address_space(si->type); bad_swap_unlock_inode: inode_unlock(inode); bad_swap: - free_percpu(p->percpu_cluster); - p->percpu_cluster = NULL; - free_percpu(p->cluster_next_cpu); - p->cluster_next_cpu = NULL; + free_percpu(si->percpu_cluster); + si->percpu_cluster = NULL; + free_percpu(si->cluster_next_cpu); + si->cluster_next_cpu = NULL; inode = NULL; - destroy_swap_extents(p); - swap_cgroup_swapoff(p->type); + destroy_swap_extents(si); + swap_cgroup_swapoff(si->type); spin_lock(&swap_lock); - p->swap_file = NULL; - p->flags = 0; + si->swap_file = NULL; + si->flags = 0; spin_unlock(&swap_lock); vfree(swap_map); kvfree(cluster_info); @@ -3526,23 +3526,23 @@ void si_swapinfo(struct sysinfo *val) */ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) { - struct swap_info_struct *p; + struct swap_info_struct *si; struct swap_cluster_info *ci; unsigned long offset; unsigned char count; unsigned char has_cache; int err, i; - p = swp_swap_info(entry); + si = swp_swap_info(entry); offset = swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); VM_WARN_ON(usage == 1 && nr > 1); - ci = lock_cluster_or_swap_info(p, offset); + ci = lock_cluster_or_swap_info(si, offset); err = 0; for (i = 0; i < nr; i++) { - count = p->swap_map[offset + i]; + count = si->swap_map[offset + i]; /* * swapin_readahead() doesn't check if a swap entry is valid, so the @@ -3570,7 +3570,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) } for (i = 0; i < nr; i++) { - count = p->swap_map[offset + i]; + count = si->swap_map[offset + i]; has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; @@ -3578,7 +3578,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) has_cache = SWAP_HAS_CACHE; else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if (swap_count_continued(p, offset + i, count)) + else if (swap_count_continued(si, offset + i, count)) count = COUNT_CONTINUED; else { /* @@ -3589,11 +3589,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) goto unlock_out; } - WRITE_ONCE(p->swap_map[offset + i], count | has_cache); + WRITE_ONCE(si->swap_map[offset + i], count | has_cache); } unlock_out: - unlock_cluster_or_swap_info(p, ci); + unlock_cluster_or_swap_info(si, ci); return err; } From bea67dcc5eea0f6a213897a59b9e644977cd07e6 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Thu, 8 Aug 2024 09:58:59 +1200 Subject: [PATCH 187/416] mm: attempt to batch free swap entries for zap_pte_range() Zhiguo reported that swap release could be a serious bottleneck during process exits[1]. With mTHP, we have the opportunity to batch free swaps. Thanks to the work of Chris and Kairui[2], I was able to achieve this optimization with minimal code changes by building on their efforts. If swap_count is 1, which is likely true as most anon memory are private, we can free all contiguous swap slots all together. Ran the below test program for measuring the bandwidth of munmap using zRAM and 64KiB mTHP: #include #include #include unsigned long long tv_to_ms(struct timeval tv) { return tv.tv_sec * 1000 + tv.tv_usec / 1000; } main() { struct timeval tv_b, tv_e; int i; #define SIZE 1024*1024*1024 void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (!p) { perror("fail to get memory"); exit(-1); } madvise(p, SIZE, MADV_HUGEPAGE); memset(p, 0x11, SIZE); /* write to get mem */ madvise(p, SIZE, MADV_PAGEOUT); gettimeofday(&tv_b, NULL); munmap(p, SIZE); gettimeofday(&tv_e, NULL); printf("munmap in bandwidth: %ld bytes/ms\n", SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b))); } The result is as below (munmap bandwidth): mm-unstable mm-unstable-with-patch round1 21053761 63161283 round2 21053761 63161283 round3 21053761 63161283 round4 20648881 67108864 round5 20648881 67108864 munmap bandwidth becomes 3X faster. [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/ [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ [v-songbaohua@oppo.com: check all swaps belong to same swap_cgroup in swap_pte_batch()] Link: https://lkml.kernel.org/r/20240815215308.55233-1-21cnbao@gmail.com [hughd@google.com: add mem_cgroup_disabled() check] Link: https://lkml.kernel.org/r/33f34a88-0130-5444-9b84-93198eeb50e7@google.com [21cnbao@gmail.com: add missing zswap_invalidate()] Link: https://lkml.kernel.org/r/20240821054921.43468-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240807215859.57491-3-21cnbao@gmail.com Signed-off-by: Barry Song Signed-off-by: Hugh Dickins Acked-by: David Hildenbrand Cc: Kairui Song Cc: Chris Li Cc: "Huang, Ying" Cc: Kalesh Singh Cc: Ryan Roberts Cc: Barry Song Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- mm/internal.h | 9 ++++-- mm/swap_cgroup.c | 2 ++ mm/swapfile.c | 78 +++++++++++++++++++++++++++++++++++++++++------- 3 files changed, 76 insertions(+), 13 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index acda347620c6..cbe4849f6e73 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -15,6 +15,7 @@ #include #include #include +#include #include /* Internal core VMA manipulation functions. */ @@ -275,18 +276,22 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) { pte_t expected_pte = pte_next_swp_offset(pte); const pte_t *end_ptep = start_ptep + max_nr; + swp_entry_t entry = pte_to_swp_entry(pte); pte_t *ptep = start_ptep + 1; + unsigned short cgroup_id; VM_WARN_ON(max_nr < 1); VM_WARN_ON(!is_swap_pte(pte)); - VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte))); + VM_WARN_ON(non_swap_entry(entry)); + cgroup_id = lookup_swap_cgroup_id(entry); while (ptep < end_ptep) { pte = ptep_get(ptep); if (!pte_same(pte, expected_pte)) break; - + if (lookup_swap_cgroup_id(pte_to_swp_entry(pte)) != cgroup_id) + break; expected_pte = pte_next_swp_offset(expected_pte); ptep++; } diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c index db6c4a26cf59..da1278f0563b 100644 --- a/mm/swap_cgroup.c +++ b/mm/swap_cgroup.c @@ -161,6 +161,8 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, */ unsigned short lookup_swap_cgroup_id(swp_entry_t ent) { + if (mem_cgroup_disabled()) + return 0; return lookup_swap_cgroup(ent, NULL)->id; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 66b567216432..5e7cbdaf725d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si, return true; } +static bool swap_is_last_map(struct swap_info_struct *si, + unsigned long offset, int nr_pages, bool *has_cache) +{ + unsigned char *map = si->swap_map + offset; + unsigned char *map_end = map + nr_pages; + unsigned char count = *map; + + if (swap_count(count) != 1) + return false; + + while (++map < map_end) { + if (*map != count) + return false; + } + + *has_cache = !!(count & SWAP_HAS_CACHE); + return true; +} + /* * returns number of pages in the folio that backs the swap entry. If positive, * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no @@ -1469,6 +1488,53 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si, return usage; } +static bool __swap_entries_free(struct swap_info_struct *si, + swp_entry_t entry, int nr) +{ + unsigned long offset = swp_offset(entry); + unsigned int type = swp_type(entry); + struct swap_cluster_info *ci; + bool has_cache = false; + unsigned char count; + int i; + + if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1) + goto fallback; + /* cross into another cluster */ + if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) + goto fallback; + + ci = lock_cluster_or_swap_info(si, offset); + if (!swap_is_last_map(si, offset, nr, &has_cache)) { + unlock_cluster_or_swap_info(si, ci); + goto fallback; + } + for (i = 0; i < nr; i++) + WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE); + unlock_cluster_or_swap_info(si, ci); + + if (!has_cache) { + for (i = 0; i < nr; i++) + zswap_invalidate(swp_entry(si->type, offset + i)); + spin_lock(&si->lock); + swap_entry_range_free(si, entry, nr); + spin_unlock(&si->lock); + } + return has_cache; + +fallback: + for (i = 0; i < nr; i++) { + if (data_race(si->swap_map[offset + i])) { + count = __swap_entry_free(si, swp_entry(type, offset + i)); + if (count == SWAP_HAS_CACHE) + has_cache = true; + } else { + WARN_ON_ONCE(1); + } + } + return has_cache; +} + /* * Drop the last HAS_CACHE flag of swap entries, caller have to * ensure all entries belong to the same cgroup. @@ -1792,11 +1858,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) { const unsigned long start_offset = swp_offset(entry); const unsigned long end_offset = start_offset + nr; - unsigned int type = swp_type(entry); struct swap_info_struct *si; bool any_only_cache = false; unsigned long offset; - unsigned char count; if (non_swap_entry(entry)) return; @@ -1811,15 +1875,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) /* * First free all entries in the range. */ - for (offset = start_offset; offset < end_offset; offset++) { - if (data_race(si->swap_map[offset])) { - count = __swap_entry_free(si, swp_entry(type, offset)); - if (count == SWAP_HAS_CACHE) - any_only_cache = true; - } else { - WARN_ON_ONCE(1); - } - } + any_only_cache = __swap_entries_free(si, entry, nr); /* * Short-circuit the below loop if none of the entries had their From 650180760be6bb448609d4d155eef3b728ace641 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:02 +0800 Subject: [PATCH 188/416] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Patch series "support large folio swap-out and swap-in for shmem", v5. Shmem will support large folio allocation [1] [2] to get a better performance, however, the memory reclaim still splits the precious large folios when trying to swap-out shmem, which may lead to the memory fragmentation issue and can not take advantage of the large folio for shmeme. Moreover, the swap code already supports for swapping out large folio without split, and large folio swap-in[3] series is queued into mm-unstable branch. Hence this patch set also supports the large folio swap-out and swap-in for shmem. This patch (of 9): To support shmem large folio swap operations, add a new parameter to swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem swap entries. While we are at it, using folio_nr_pages() to get the number of pages of the folio as a preparation. Link: https://lkml.kernel.org/r/cover.1723434324.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Reviewed-by: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/swap.h | 4 ++-- mm/shmem.c | 6 ++++-- mm/swapfile.c | 4 ++-- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1c8f844a9f0f..248db1dd7812 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -481,7 +481,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order); extern int add_swap_count_continuation(swp_entry_t, gfp_t); -extern void swap_shmem_alloc(swp_entry_t); +extern void swap_shmem_alloc(swp_entry_t, int); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); @@ -548,7 +548,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) return 0; } -static inline void swap_shmem_alloc(swp_entry_t swp) +static inline void swap_shmem_alloc(swp_entry_t swp, int nr) { } diff --git a/mm/shmem.c b/mm/shmem.c index 4a5254bfd610..22cdc10f27ea 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1452,6 +1452,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); swp_entry_t swap; pgoff_t index; + int nr_pages; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -1484,6 +1485,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) } index = folio->index; + nr_pages = folio_nr_pages(folio); /* * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC @@ -1536,8 +1538,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) if (add_to_swap_cache(folio, swap, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, NULL) == 0) { - shmem_recalc_inode(inode, 0, 1); - swap_shmem_alloc(swap); + shmem_recalc_inode(inode, 0, nr_pages); + swap_shmem_alloc(swap, nr_pages); shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap)); mutex_unlock(&shmem_swaplist_mutex); diff --git a/mm/swapfile.c b/mm/swapfile.c index 5e7cbdaf725d..8ff58be40544 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3657,9 +3657,9 @@ unlock_out: * Help swapoff by noting that swap entry belongs to shmem/tmpfs * (in which case its reference count is never incremented). */ -void swap_shmem_alloc(swp_entry_t entry) +void swap_shmem_alloc(swp_entry_t entry, int nr) { - __swap_duplicate(entry, SWAP_MAP_SHMEM, 1); + __swap_duplicate(entry, SWAP_MAP_SHMEM, nr); } /* From 50f381eccefda0cdaf7aa617587dc04cb6652085 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:03 +0800 Subject: [PATCH 189/416] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap To support shmem large folio swapout in the following patches, using xa_get_order() to get the order of the swap entry to calculate the swap usage of shmem. Link: https://lkml.kernel.org/r/60b130b9fc3e422bb91293a172c2113c85e9233a.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 22cdc10f27ea..02fb188d627f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -890,7 +890,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping, if (xas_retry(&xas, page)) continue; if (xa_is_value(page)) - swapped++; + swapped += 1 << xa_get_order(xas.xa, xas.xa_index); if (xas.xa_index == max) break; if (need_resched()) { From 6ea0d1ccb110387244e04637f28a1d2eda54e3fb Mon Sep 17 00:00:00 2001 From: Daniel Gomez Date: Mon, 12 Aug 2024 15:42:04 +0800 Subject: [PATCH 190/416] mm: shmem: return number of pages beeing freed in shmem_free_swap Both shmem_free_swap callers expect the number of pages being freed. In the large folios context, this needs to support larger values other than 0 (used as 1 page being freed) and -ENOENT (used as 0 pages being freed). In preparation for large folios adoption, make shmem_free_swap routine return the number of pages being freed. So, returning 0 in this context, means 0 pages being freed. While we are at it, changing to use free_swap_and_cache_nr() to free large order swap entry by Baolin Wang. Link: https://lkml.kernel.org/r/9623e863c83d749d5ab407f6fdf0a8e5a3bdf052.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Daniel Gomez Signed-off-by: Baolin Wang Suggested-by: Matthew Wilcox Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 02fb188d627f..d0d54939da48 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -856,18 +856,22 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) } /* - * Remove swap entry from page cache, free the swap and its page cache. + * Remove swap entry from page cache, free the swap and its page cache. Returns + * the number of pages being freed. 0 means entry not found in XArray (0 pages + * being freed). */ -static int shmem_free_swap(struct address_space *mapping, - pgoff_t index, void *radswap) +static long shmem_free_swap(struct address_space *mapping, + pgoff_t index, void *radswap) { + int order = xa_get_order(&mapping->i_pages, index); void *old; old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); if (old != radswap) - return -ENOENT; - free_swap_and_cache(radix_to_swp_entry(radswap)); - return 0; + return 0; + free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); + + return 1 << order; } /* @@ -1019,7 +1023,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (xa_is_value(folio)) { if (unfalloc) continue; - nr_swaps_freed += !shmem_free_swap(mapping, + nr_swaps_freed += shmem_free_swap(mapping, indices[i], folio); continue; } @@ -1086,14 +1090,17 @@ whole_folios: folio = fbatch.folios[i]; if (xa_is_value(folio)) { + long swaps_freed; + if (unfalloc) continue; - if (shmem_free_swap(mapping, indices[i], folio)) { + swaps_freed = shmem_free_swap(mapping, indices[i], folio); + if (!swaps_freed) { /* Swap was replaced by page: retry */ index = indices[i]; break; } - nr_swaps_freed++; + nr_swaps_freed += swaps_freed; continue; } From fb72415938d109fe8cca339a4f1423f76ba213c5 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:05 +0800 Subject: [PATCH 191/416] mm: filemap: use xa_get_order() to get the swap entry order In the following patches, shmem will support the swap out of large folios, which means the shmem mappings may contain large order swap entries, so using xa_get_order() to get the folio order of the shmem swap entry to update the '*start' correctly. [hughd@google.com: use xa_get_order() to get the swap entry order] Link: https://lkml.kernel.org/r/c336e6e4-da7f-b714-c0f1-12df715f2611@google.com Link: https://lkml.kernel.org/r/6876d55145c1cc80e79df7884aa3a62e397b101d.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Signed-off-by: Hugh Dickins Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/filemap.c | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index fcfec2a78a30..fdaa1e5e3078 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2046,17 +2046,20 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, if (!folio_batch_add(fbatch, folio)) break; } - rcu_read_unlock(); if (folio_batch_count(fbatch)) { - unsigned long nr = 1; + unsigned long nr; int idx = folio_batch_count(fbatch) - 1; folio = fbatch->folios[idx]; if (!xa_is_value(folio)) nr = folio_nr_pages(folio); - *start = indices[idx] + nr; + else + nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]); + *start = round_down(indices[idx] + nr, nr); } + rcu_read_unlock(); + return folio_batch_count(fbatch); } @@ -2088,10 +2091,17 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, rcu_read_lock(); while ((folio = find_get_entry(&xas, end, XA_PRESENT))) { + unsigned long base; + unsigned long nr; + if (!xa_is_value(folio)) { - if (folio->index < *start) + nr = folio_nr_pages(folio); + base = folio->index; + /* Omit large folio which begins before the start */ + if (base < *start) goto put; - if (folio_next_index(folio) - 1 > end) + /* Omit large folio which extends beyond the end */ + if (base + nr - 1 > end) goto put; if (!folio_trylock(folio)) goto put; @@ -2100,7 +2110,19 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, goto unlock; VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), folio); + } else { + nr = 1 << xa_get_order(&mapping->i_pages, xas.xa_index); + base = xas.xa_index & ~(nr - 1); + /* Omit order>0 value which begins before the start */ + if (base < *start) + continue; + /* Omit order>0 value which extends beyond the end */ + if (base + nr - 1 > end) + break; } + + /* Update start now so that last update is correct on return */ + *start = base + nr; indices[fbatch->nr] = xas.xa_index; if (!folio_batch_add(fbatch, folio)) break; @@ -2112,15 +2134,6 @@ put: } rcu_read_unlock(); - if (folio_batch_count(fbatch)) { - unsigned long nr = 1; - int idx = folio_batch_count(fbatch) - 1; - - folio = fbatch->folios[idx]; - if (!xa_is_value(folio)) - nr = folio_nr_pages(folio); - *start = indices[idx] + nr; - } return folio_batch_count(fbatch); } From 40ff2d11bd58a3908897ccc689ed5d8ac498f173 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:06 +0800 Subject: [PATCH 192/416] mm: shmem: use swap_free_nr() to free shmem swap entries As a preparation for supporting shmem large folio swapout, use swap_free_nr() to free some continuous swap entries of the shmem large folio when the large folio was swapped in from the swap cache. In addition, the index should also be round down to the number of pages when adding the swapin folio into the pagecache. Link: https://lkml.kernel.org/r/342207fa679fc88a447dac2e101ad79e6050fe79.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index d0d54939da48..f6bab42180ea 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1961,6 +1961,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, struct address_space *mapping = inode->i_mapping; swp_entry_t swapin_error; void *old; + int nr_pages; swapin_error = make_poisoned_swp_entry(); old = xa_cmpxchg_irq(&mapping->i_pages, index, @@ -1969,6 +1970,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, if (old != swp_to_radix_entry(swap)) return; + nr_pages = folio_nr_pages(folio); folio_wait_writeback(folio); delete_from_swap_cache(folio); /* @@ -1976,8 +1978,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks) * in shmem_evict_inode(). */ - shmem_recalc_inode(inode, -1, -1); - swap_free(swap); + shmem_recalc_inode(inode, -nr_pages, -nr_pages); + swap_free_nr(swap, nr_pages); } /* @@ -1996,7 +1998,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct swap_info_struct *si; struct folio *folio = NULL; swp_entry_t swap; - int error; + int error, nr_pages; VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); swap = radix_to_swp_entry(*foliop); @@ -2043,6 +2045,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, goto failed; } folio_wait_writeback(folio); + nr_pages = folio_nr_pages(folio); /* * Some architectures may have to restore extra metadata to the @@ -2056,19 +2059,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, goto failed; } - error = shmem_add_to_page_cache(folio, mapping, index, + error = shmem_add_to_page_cache(folio, mapping, + round_down(index, nr_pages), swp_to_radix_entry(swap), gfp); if (error) goto failed; - shmem_recalc_inode(inode, 0, -1); + shmem_recalc_inode(inode, 0, -nr_pages); if (sgp == SGP_WRITE) folio_mark_accessed(folio); delete_from_swap_cache(folio); folio_mark_dirty(folio); - swap_free(swap); + swap_free_nr(swap, nr_pages); put_swap_device(si); *foliop = folio; From 736f0e03564729a5ba609c2c4fcb37ff1e92ede4 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:07 +0800 Subject: [PATCH 193/416] mm: shmem: support large folio allocation for shmem_replace_folio() To support large folio swapin for shmem in the following patches, add large folio allocation for the new replacement folio in shmem_replace_folio(). Moreover large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache, therefore we should replace each consecutive entries in the swap cache instead of using the shmem_replace_entry(). As well as updating statistics and folio reference count using the number of pages in the folio. [baolin.wang@linux.alibaba.com: fix the gfp flag for large folio allocation] Link: https://lkml.kernel.org/r/5b1e9c5a-7f61-4d97-a8d7-41767ca04c77@linux.alibaba.com [baolin.wang@linux.alibaba.com: fix build without CONFIG_TRANSPARENT_HUGEPAGE] Link: https://lkml.kernel.org/r/8c03467c-63b2-43b4-9851-222d4188725c@linux.alibaba.com Link: https://lkml.kernel.org/r/a41138ecc857ef13e7c5ffa0174321e9e2c9970a.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 74 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 46 insertions(+), 28 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index f6bab42180ea..e63393ced001 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -155,7 +155,7 @@ static unsigned long shmem_default_max_inodes(void) static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, gfp_t gfp, - struct mm_struct *fault_mm, vm_fault_t *fault_type); + struct vm_area_struct *vma, vm_fault_t *fault_type); static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb) { @@ -1887,30 +1887,35 @@ static bool shmem_should_replace_folio(struct folio *folio, gfp_t gfp) } static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, - struct shmem_inode_info *info, pgoff_t index) + struct shmem_inode_info *info, pgoff_t index, + struct vm_area_struct *vma) { - struct folio *old, *new; - struct address_space *swap_mapping; - swp_entry_t entry; - pgoff_t swap_index; - int error; - - old = *foliop; - entry = old->swap; - swap_index = swap_cache_index(entry); - swap_mapping = swap_address_space(entry); + struct folio *new, *old = *foliop; + swp_entry_t entry = old->swap; + struct address_space *swap_mapping = swap_address_space(entry); + pgoff_t swap_index = swap_cache_index(entry); + XA_STATE(xas, &swap_mapping->i_pages, swap_index); + int nr_pages = folio_nr_pages(old); + int error = 0, i; /* * We have arrived here because our zones are constrained, so don't * limit chance of success by further cpuset and node constraints. */ gfp &= ~GFP_CONSTRAINT_MASK; - VM_BUG_ON_FOLIO(folio_test_large(old), old); - new = shmem_alloc_folio(gfp, 0, info, index); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (nr_pages > 1) { + gfp_t huge_gfp = vma_thp_gfp_mask(vma); + + gfp = limit_gfp_mask(huge_gfp, gfp); + } +#endif + + new = shmem_alloc_folio(gfp, folio_order(old), info, index); if (!new) return -ENOMEM; - folio_get(new); + folio_ref_add(new, nr_pages); folio_copy(new, old); flush_dcache_folio(new); @@ -1920,18 +1925,25 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, new->swap = entry; folio_set_swapcache(new); - /* - * Our caller will very soon move newpage out of swapcache, but it's - * a nice clean interface for us to replace oldpage by newpage there. - */ + /* Swap cache still stores N entries instead of a high-order entry */ xa_lock_irq(&swap_mapping->i_pages); - error = shmem_replace_entry(swap_mapping, swap_index, old, new); + for (i = 0; i < nr_pages; i++) { + void *item = xas_load(&xas); + + if (item != old) { + error = -ENOENT; + break; + } + + xas_store(&xas, new); + xas_next(&xas); + } if (!error) { mem_cgroup_replace_folio(old, new); - __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1); - __lruvec_stat_mod_folio(new, NR_SHMEM, 1); - __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -1); - __lruvec_stat_mod_folio(old, NR_SHMEM, -1); + __lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages); + __lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages); + __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages); + __lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages); } xa_unlock_irq(&swap_mapping->i_pages); @@ -1951,7 +1963,12 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, old->private = NULL; folio_unlock(old); - folio_put_refs(old, 2); + /* + * The old folio are removed from swap cache, drop the 'nr_pages' + * reference, as well as one temporary reference getting from swap + * cache. + */ + folio_put_refs(old, nr_pages + 1); return error; } @@ -1990,10 +2007,11 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, */ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, - gfp_t gfp, struct mm_struct *fault_mm, + gfp_t gfp, struct vm_area_struct *vma, vm_fault_t *fault_type) { struct address_space *mapping = inode->i_mapping; + struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; struct shmem_inode_info *info = SHMEM_I(inode); struct swap_info_struct *si; struct folio *folio = NULL; @@ -2054,7 +2072,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, arch_swap_restore(folio_swap(swap, folio), folio); if (shmem_should_replace_folio(folio, gfp)) { - error = shmem_replace_folio(&folio, gfp, info, index); + error = shmem_replace_folio(&folio, gfp, info, index, vma); if (error) goto failed; } @@ -2135,7 +2153,7 @@ repeat: if (xa_is_value(folio)) { error = shmem_swapin_folio(inode, index, &folio, - sgp, gfp, fault_mm, fault_type); + sgp, gfp, vma, fault_type); if (error == -EEXIST) goto repeat; From 872339c31f3b2dd466320a5bed54abeccf0db47b Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:08 +0800 Subject: [PATCH 194/416] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() To support large folio swapin/swapout for shmem in the following patches, drop the folio's reference count by the number of pages contained in the folio when a shmem folio is deleted from shmem pagecache after adding into swap cache. Link: https://lkml.kernel.org/r/b371eadb27f42fc51261c51008fbb9a334985b4c.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index e63393ced001..598d003b7a89 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -851,7 +851,7 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr); __lruvec_stat_mod_folio(folio, NR_SHMEM, -nr); xa_unlock_irq(&mapping->i_pages); - folio_put(folio); + folio_put_refs(folio, nr); BUG_ON(error); } From 12885cbe88ddf6c5fc3306193a6449ac310f2331 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:09 +0800 Subject: [PATCH 195/416] mm: shmem: split large entry if the swapin folio is not large Now the swap device can only swap-in order 0 folio, even though a large folio is swapped out. This requires us to split the large entry previously saved in the shmem pagecache to support the swap in of small folios. [hughd@google.com: fix warnings from kmalloc_fix_flags()] Link: https://lkml.kernel.org/r/e2a2ba5d-864c-50aa-7579-97cba1c7dd0c@google.com [baolin.wang@linux.alibaba.com: drop the 'new_order' parameter] Link: https://lkml.kernel.org/r/39c71ccf-669b-4d9f-923c-f6b9c4ceb8df@linux.alibaba.com Link: https://lkml.kernel.org/r/4a0f12f27c54a62eb4d9ca1265fed3a62531a63e.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Signed-off-by: Hugh Dickins Cc: Barry Song Cc: Chris Li Cc: Daniel Gomez Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/shmem.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index 598d003b7a89..4af6f3ea9c4f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1999,6 +1999,84 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, swap_free_nr(swap, nr_pages); } +static int shmem_split_large_entry(struct inode *inode, pgoff_t index, + swp_entry_t swap, gfp_t gfp) +{ + struct address_space *mapping = inode->i_mapping; + XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); + void *alloced_shadow = NULL; + int alloced_order = 0, i; + + /* Convert user data gfp flags to xarray node gfp flags */ + gfp &= GFP_RECLAIM_MASK; + + for (;;) { + int order = -1, split_order = 0; + void *old = NULL; + + xas_lock_irq(&xas); + old = xas_load(&xas); + if (!xa_is_value(old) || swp_to_radix_entry(swap) != old) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + + order = xas_get_order(&xas); + + /* Swap entry may have changed before we re-acquire the lock */ + if (alloced_order && + (old != alloced_shadow || order != alloced_order)) { + xas_destroy(&xas); + alloced_order = 0; + } + + /* Try to split large swap entry in pagecache */ + if (order > 0) { + if (!alloced_order) { + split_order = order; + goto unlock; + } + xas_split(&xas, old, order); + + /* + * Re-set the swap entry after splitting, and the swap + * offset of the original large entry must be continuous. + */ + for (i = 0; i < 1 << order; i++) { + pgoff_t aligned_index = round_down(index, 1 << order); + swp_entry_t tmp; + + tmp = swp_entry(swp_type(swap), swp_offset(swap) + i); + __xa_store(&mapping->i_pages, aligned_index + i, + swp_to_radix_entry(tmp), 0); + } + } + +unlock: + xas_unlock_irq(&xas); + + /* split needed, alloc here and retry. */ + if (split_order) { + xas_split_alloc(&xas, old, split_order, gfp); + if (xas_error(&xas)) + goto error; + alloced_shadow = old; + alloced_order = split_order; + xas_reset(&xas); + continue; + } + + if (!xas_nomem(&xas, gfp)) + break; + } + +error: + if (xas_error(&xas)) + return xas_error(&xas); + + return alloced_order; +} + /* * Swap in the folio pointed to by *foliop. * Caller has to make sure that *foliop contains a valid swapped folio. @@ -2036,12 +2114,37 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap, NULL, 0); if (!folio) { + int split_order; + /* Or update major stats only when swapin succeeds?? */ if (fault_type) { *fault_type |= VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } + + /* + * Now swap device can only swap in order 0 folio, then we + * should split the large swap entry stored in the pagecache + * if necessary. + */ + split_order = shmem_split_large_entry(inode, index, swap, gfp); + if (split_order < 0) { + error = split_order; + goto failed; + } + + /* + * If the large swap entry has already been split, it is + * necessary to recalculate the new swap entry based on + * the old order alignment. + */ + if (split_order > 0) { + pgoff_t offset = index - round_down(index, 1 << split_order); + + swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); + } + /* Here we actually start the io */ folio = shmem_swapin_cluster(swap, gfp, info, index); if (!folio) { From 809bc86517cc408b5b8cb8e08e69096639432bc8 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 12 Aug 2024 15:42:10 +0800 Subject: [PATCH 196/416] mm: shmem: support large folio swap out Shmem will support large folio allocation [1] [2] to get a better performance, however, the memory reclaim still splits the precious large folios when trying to swap out shmem, which may lead to the memory fragmentation issue and can not take advantage of the large folio for shmeme. Moreover, the swap code already supports for swapping out large folio without split, hence this patch set supports the large folio swap out for shmem. Note the i915_gem_shmem driver still need to be split when swapping, thus add a new flag 'split_large_folio' for writeback_control to indicate spliting the large folio. [1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/ [2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ [hughd@google.com: shmem_writepage() split folio at EOF before swapout] Link: https://lkml.kernel.org/r/aef55f8d-6040-692d-65e3-16150cce4440@google.com [baolin.wang@linux.alibaba.com: remove the wbc->split_large_folio per Hugh] Link: https://lkml.kernel.org/r/1236a002daa301b3b9ba73d6c0fab348427cf295.1724833399.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/d80c21abd20e1b0f5ca66b330f074060fb2f082d.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Signed-off-by: Hugh Dickins Cc: Barry Song Cc: Chris Li Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Kefeng Wang Cc: Lance Yang Cc: Matthew Wilcox Cc: Pankaj Raghav Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/writeback.h | 3 +++ mm/shmem.c | 28 ++++++++++++++++++++++------ mm/vmscan.c | 30 +++++++++++++++++++++++------- 3 files changed, 48 insertions(+), 13 deletions(-) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 1a54676d843a..51278327b3c6 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -79,6 +79,9 @@ struct writeback_control { */ struct swap_iocb **swap_plug; + /* Target list for splitting a large folio */ + struct list_head *list; + /* internal fields used by the ->writepages implementation: */ struct folio_batch fbatch; pgoff_t index; diff --git a/mm/shmem.c b/mm/shmem.c index 4af6f3ea9c4f..1448a7a9a282 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -795,7 +795,6 @@ static int shmem_add_to_page_cache(struct folio *folio, VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio); - VM_BUG_ON(expected && folio_test_large(folio)); folio_ref_add(folio, nr); folio->mapping = mapping; @@ -1460,6 +1459,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) swp_entry_t swap; pgoff_t index; int nr_pages; + bool split = false; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -1478,14 +1478,26 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; /* - * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or - * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages, - * and its shmem_writeback() needs them to be split when swapping. + * If CONFIG_THP_SWAP is not enabled, the large folio should be + * split when swapping. + * + * And shrinkage of pages beyond i_size does not split swap, so + * swapout of a large folio crossing i_size needs to split too + * (unless fallocate has been used to preallocate beyond EOF). */ if (folio_test_large(folio)) { + index = shmem_fallocend(inode, + DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)); + if ((index > folio->index && index < folio_next_index(folio)) || + !IS_ENABLED(CONFIG_THP_SWAP)) + split = true; + } + + if (split) { +try_split: /* Ensure the subpages are still dirty */ folio_test_set_dirty(folio); - if (split_huge_page(page) < 0) + if (split_huge_page_to_list_to_order(page, wbc->list, 0)) goto redirty; folio = page_folio(page); folio_clear_dirty(folio); @@ -1527,8 +1539,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) } swap = folio_alloc_swap(folio); - if (!swap.val) + if (!swap.val) { + if (nr_pages > 1) + goto try_split; + goto redirty; + } /* * Add inode to shmem_unuse()'s list of swapped-out inodes, diff --git a/mm/vmscan.c b/mm/vmscan.c index 243b62b9f89d..1b6542aaf81e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -628,7 +628,7 @@ typedef enum { * Calls ->writepage(). */ static pageout_t pageout(struct folio *folio, struct address_space *mapping, - struct swap_iocb **plug) + struct swap_iocb **plug, struct list_head *folio_list) { /* * If the folio is dirty, only perform writeback if that write @@ -676,6 +676,14 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping, .swap_plug = plug, }; + /* + * The large shmem folio can be split if CONFIG_THP_SWAP is + * not enabled or contiguous swap entries are failed to + * allocate. + */ + if (shmem_mapping(mapping) && folio_test_large(folio)) + wbc.list = folio_list; + folio_set_reclaim(folio); res = mapping->a_ops->writepage(&folio->page, &wbc); if (res < 0) @@ -1257,11 +1265,6 @@ retry: goto activate_locked_split; } } - } else if (folio_test_swapbacked(folio) && - folio_test_large(folio)) { - /* Split shmem folio */ - if (split_folio_to_list(folio, folio_list)) - goto keep_locked; } /* @@ -1362,12 +1365,25 @@ retry: * starts and then write it out here. */ try_to_unmap_flush_dirty(); - switch (pageout(folio, mapping, &plug)) { + switch (pageout(folio, mapping, &plug, folio_list)) { case PAGE_KEEP: goto keep_locked; case PAGE_ACTIVATE: + /* + * If shmem folio is split when writeback to swap, + * the tail pages will make their own pass through + * this function and be accounted then. + */ + if (nr_pages > 1 && !folio_test_large(folio)) { + sc->nr_scanned -= (nr_pages - 1); + nr_pages = 1; + } goto activate_locked; case PAGE_SUCCESS: + if (nr_pages > 1 && !folio_test_large(folio)) { + sc->nr_scanned -= (nr_pages - 1); + nr_pages = 1; + } stat->nr_pageout += nr_pages; if (folio_test_writeback(folio)) From 78788c3ede90727ffb7b17287468a08b4e78ee3d Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Wed, 14 Aug 2024 00:40:27 +0200 Subject: [PATCH 197/416] kasan: simplify and clarify Makefile When KASAN support was being added to the Linux kernel, GCC did not yet support all of the KASAN-related compiler options. Thus, the KASAN Makefile had to probe the compiler for supported options. Nowadays, the Linux kernel GCC version requirement is 5.1+, and thus we don't need the probing of the -fasan-shadow-offset parameter: it exists in all 5.1+ GCCs. Simplify the KASAN Makefile to drop CFLAGS_KASAN_MINIMAL. Also add a few more comments and unify the indentation. [andreyknvl@gmail.com: comments fixes per Miguel] Link: https://lkml.kernel.org/r/20240814161052.10374-1-andrey.konovalov@linux.dev Link: https://lkml.kernel.org/r/20240813224027.84503-1-andrey.konovalov@linux.dev Signed-off-by: Andrey Konovalov Reviewed-by: Miguel Ojeda Acked-by: Marco Elver Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Matthew Maurer Signed-off-by: Andrew Morton --- scripts/Makefile.kasan | 43 +++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 21 deletions(-) diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan index 390658a2d5b7..aab4154af00a 100644 --- a/scripts/Makefile.kasan +++ b/scripts/Makefile.kasan @@ -22,30 +22,31 @@ endif ifdef CONFIG_KASAN_GENERIC ifdef CONFIG_KASAN_INLINE + # When the number of memory accesses in a function is less than this + # call threshold number, the compiler will use inline instrumentation. + # 10000 is chosen offhand as a sufficiently large number to make all + # kernel functions to be instrumented inline. call_threshold := 10000 else call_threshold := 0 endif -CFLAGS_KASAN_MINIMAL := -fsanitize=kernel-address +# First, enable -fsanitize=kernel-address together with providing the shadow +# mapping offset, as for GCC, -fasan-shadow-offset fails without -fsanitize +# (GCC accepts the shadow mapping offset via -fasan-shadow-offset instead of +# a --param like the other KASAN parameters). +# Instead of ifdef-checking the compiler, rely on cc-option. +CFLAGS_KASAN := $(call cc-option, -fsanitize=kernel-address \ + -fasan-shadow-offset=$(KASAN_SHADOW_OFFSET), \ + $(call cc-option, -fsanitize=kernel-address \ + -mllvm -asan-mapping-offset=$(KASAN_SHADOW_OFFSET))) -# -fasan-shadow-offset fails without -fsanitize -CFLAGS_KASAN_SHADOW := $(call cc-option, -fsanitize=kernel-address \ - -fasan-shadow-offset=$(KASAN_SHADOW_OFFSET), \ - $(call cc-option, -fsanitize=kernel-address \ - -mllvm -asan-mapping-offset=$(KASAN_SHADOW_OFFSET))) - -ifeq ($(strip $(CFLAGS_KASAN_SHADOW)),) - CFLAGS_KASAN := $(CFLAGS_KASAN_MINIMAL) -else - # Now add all the compiler specific options that are valid standalone - CFLAGS_KASAN := $(CFLAGS_KASAN_SHADOW) \ - $(call cc-param,asan-globals=1) \ - $(call cc-param,asan-instrumentation-with-call-threshold=$(call_threshold)) \ - $(call cc-param,asan-instrument-allocas=1) -endif - -CFLAGS_KASAN += $(call cc-param,asan-stack=$(stack_enable)) +# Now, add other parameters enabled similarly in both GCC and Clang. +# As some of them are not supported by older compilers, use cc-param. +CFLAGS_KASAN += $(call cc-param,asan-instrumentation-with-call-threshold=$(call_threshold)) \ + $(call cc-param,asan-stack=$(stack_enable)) \ + $(call cc-param,asan-instrument-allocas=1) \ + $(call cc-param,asan-globals=1) # Instrument memcpy/memset/memmove calls by using instrumented __asan_mem*() # instead. With compilers that don't support this option, compiler-inserted @@ -57,9 +58,9 @@ endif # CONFIG_KASAN_GENERIC ifdef CONFIG_KASAN_SW_TAGS ifdef CONFIG_KASAN_INLINE - instrumentation_flags := $(call cc-param,hwasan-mapping-offset=$(KASAN_SHADOW_OFFSET)) + instrumentation_flags := $(call cc-param,hwasan-mapping-offset=$(KASAN_SHADOW_OFFSET)) else - instrumentation_flags := $(call cc-param,hwasan-instrument-with-calls=1) + instrumentation_flags := $(call cc-param,hwasan-instrument-with-calls=1) endif CFLAGS_KASAN := -fsanitize=kernel-hwaddress \ @@ -70,7 +71,7 @@ CFLAGS_KASAN := -fsanitize=kernel-hwaddress \ # Instrument memcpy/memset/memmove calls by using instrumented __hwasan_mem*(). ifeq ($(call clang-min-version, 150000)$(call gcc-min-version, 130000),y) -CFLAGS_KASAN += $(call cc-param,hwasan-kernel-mem-intrinsic-prefix=1) + CFLAGS_KASAN += $(call cc-param,hwasan-kernel-mem-intrinsic-prefix=1) endif endif # CONFIG_KASAN_SW_TAGS From f77f0c7514789577125c1b2df145703161736359 Mon Sep 17 00:00:00 2001 From: Kaiyang Zhao Date: Wed, 14 Aug 2024 17:42:27 +0000 Subject: [PATCH 198/416] mm,memcg: provide per-cgroup counters for NUMA balancing operations The ability to observe the demotion and promotion decisions made by the kernel on a per-cgroup basis is important for monitoring and tuning containerized workloads on machines equipped with tiered memory. Different containers in the system may experience drastically different memory tiering actions that cannot be distinguished from the global counters alone. For example, a container running a workload that has a much hotter memory accesses will likely see more promotions and fewer demotions, potentially depriving a colocated container of top tier memory to such an extent that its performance degrades unacceptably. For another example, some containers may exhibit longer periods between data reuse, causing much more numa_hint_faults than numa_pages_migrated. In this case, tuning hot_threshold_ms may be appropriate, but the signal can easily be lost if only global counters are available. In the long term, we hope to introduce per-cgroup control of promotion and demotion actions to implement memory placement policies in tiering. This patch set adds seven counters to memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd, pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_* and pgpromote_success are also available in memory.numa_stat. count_memcg_events_mm() is added to count multiple event occurrences at once, and get_mem_cgroup_from_folio() is added because we need to get a reference to the memcg of a folio before it's migrated to track numa_pages_migrated. The accounting of PGDEMOTE_* is moved to shrink_inactive_list() before being changed to per-cgroup. [kaiyang2@cs.cmu.edu: add documentation of the memcg counters in cgroup-v2.rst] Link: https://lkml.kernel.org/r/20240814235122.252309-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao Cc: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Wei Xu Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 19 +++++++++++ include/linux/memcontrol.h | 24 +++++++++++-- include/linux/vmstat.h | 1 + mm/memcontrol.c | 45 +++++++++++++++++++++++++ mm/memory.c | 3 ++ mm/mempolicy.c | 4 ++- mm/migrate.c | 7 ++-- mm/vmscan.c | 8 ++--- 8 files changed, 101 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f0499884124d..d3344218010c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1617,6 +1617,25 @@ The following nested keys are defined. Usually because failed to allocate some continuous swap space for the huge page. + numa_pages_migrated (npn) + Number of pages migrated by NUMA balancing. + + numa_pte_updates (npn) + Number of pages whose page table entries are modified by + NUMA balancing to produce NUMA hinting faults on access. + + numa_hint_faults (npn) + Number of NUMA hinting faults. + + pgdemote_kswapd + Number of pages demoted by kswapd. + + pgdemote_direct + Number of pages demoted directly. + + pgdemote_khugepaged + Number of pages demoted by khugepaged. + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ed170399179a..fe05fdb92779 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -784,6 +784,8 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); struct mem_cgroup *get_mem_cgroup_from_current(void); +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio); + struct lruvec *folio_lruvec_lock(struct folio *folio); struct lruvec *folio_lruvec_lock_irq(struct folio *folio); struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, @@ -1028,8 +1030,8 @@ static inline void count_memcg_folio_events(struct folio *folio, count_memcg_events(memcg, idx, nr); } -static inline void count_memcg_event_mm(struct mm_struct *mm, - enum vm_event_item idx) +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) { struct mem_cgroup *memcg; @@ -1039,10 +1041,16 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); if (likely(memcg)) - count_memcg_events(memcg, idx, 1); + count_memcg_events(memcg, idx, count); rcu_read_unlock(); } +static inline void count_memcg_event_mm(struct mm_struct *mm, + enum vm_event_item idx) +{ + count_memcg_events_mm(mm, idx, 1); +} + static inline void memcg_memory_event(struct mem_cgroup *memcg, enum memcg_memory_event event) { @@ -1262,6 +1270,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_current(void) return NULL; } +static inline struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + return NULL; +} + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) { @@ -1484,6 +1497,11 @@ static inline void count_memcg_folio_events(struct folio *folio, { } +static inline void count_memcg_events_mm(struct mm_struct *mm, + enum vm_event_item idx, unsigned long count) +{ +} + static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 9eb77c9007e6..d2761bf8ff32 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -32,6 +32,7 @@ struct reclaim_stat { unsigned nr_ref_keep; unsigned nr_unmap_fail; unsigned nr_lazyfree_fail; + unsigned nr_demoted; }; /* Stat data for system wide items */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1decbd38216e..0370ce03ce4c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -304,6 +304,12 @@ static const unsigned int memcg_node_stat_items[] = { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif +#ifdef CONFIG_NUMA_BALANCING + PGPROMOTE_SUCCESS, +#endif + PGDEMOTE_KSWAPD, + PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, }; static const unsigned int memcg_stat_items[] = { @@ -436,6 +442,11 @@ static const unsigned int memcg_vm_event_stat[] = { THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif +#ifdef CONFIG_NUMA_BALANCING + NUMA_PAGE_MIGRATE, + NUMA_PTE_UPDATES, + NUMA_HINT_FAULTS, +#endif }; #define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat) @@ -935,6 +946,24 @@ again: return memcg; } +/** + * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg. + * @folio: folio from which memcg should be extracted. + */ +struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio) +{ + struct mem_cgroup *memcg = folio_memcg(folio); + + if (mem_cgroup_disabled()) + return NULL; + + rcu_read_lock(); + if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css))) + memcg = root_mem_cgroup; + rcu_read_unlock(); + return memcg; +} + /** * mem_cgroup_iter - iterate over memory cgroup hierarchy * @root: hierarchy root @@ -1340,6 +1369,13 @@ static const struct memory_stat memory_stats[] = { { "workingset_restore_anon", WORKINGSET_RESTORE_ANON }, { "workingset_restore_file", WORKINGSET_RESTORE_FILE }, { "workingset_nodereclaim", WORKINGSET_NODERECLAIM }, + + { "pgdemote_kswapd", PGDEMOTE_KSWAPD }, + { "pgdemote_direct", PGDEMOTE_DIRECT }, + { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, +#ifdef CONFIG_NUMA_BALANCING + { "pgpromote_success", PGPROMOTE_SUCCESS }, +#endif }; /* The actual unit of the state item, not the same as the output unit */ @@ -1364,6 +1400,9 @@ static int memcg_page_state_output_unit(int item) /* * Workingset state is actually in pages, but we export it to userspace * as a scalar count of events, so special case it here. + * + * Demotion and promotion activities are exported in pages, consistent + * with their global counterparts. */ switch (item) { case WORKINGSET_REFAULT_ANON: @@ -1373,6 +1412,12 @@ static int memcg_page_state_output_unit(int item) case WORKINGSET_RESTORE_ANON: case WORKINGSET_RESTORE_FILE: case WORKINGSET_NODERECLAIM: + case PGDEMOTE_KSWAPD: + case PGDEMOTE_DIRECT: + case PGDEMOTE_KHUGEPAGED: +#ifdef CONFIG_NUMA_BALANCING + case PGPROMOTE_SUCCESS: +#endif return 1; default: return memcg_page_state_unit(item); diff --git a/mm/memory.c b/mm/memory.c index d1c741a39630..c31ea300cdf6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5236,6 +5236,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, vma_set_access_pid_bit(vma); count_vm_numa_event(NUMA_HINT_FAULTS); +#ifdef CONFIG_NUMA_BALANCING + count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1); +#endif if (folio_nid(folio) == numa_node_id()) { count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); *flags |= TNF_FAULT_LOCAL; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b3b5f376471f..b646fab3e45e 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -676,8 +676,10 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, tlb_gather_mmu(&tlb, vma->vm_mm); nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA); - if (nr_updated > 0) + if (nr_updated > 0) { count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); + count_memcg_events_mm(vma->vm_mm, NUMA_PTE_UPDATES, nr_updated); + } tlb_finish_mmu(&tlb); diff --git a/mm/migrate.c b/mm/migrate.c index 66a5f73ebfdf..76cfc6c42eb3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2614,6 +2614,8 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); + struct mem_cgroup *memcg = get_mem_cgroup_from_folio(folio); + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); list_add(&folio->lru, &migratepages); nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2623,12 +2625,13 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, putback_movable_pages(&migratepages); if (nr_succeeded) { count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) - mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, - nr_succeeded); + mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded); } + mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 1b6542aaf81e..1b1fad0c1e11 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1016,9 +1016,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - mod_node_page_state(pgdat, PGDEMOTE_KSWAPD + reclaimer_offset(), - nr_succeeded); - return nr_succeeded; } @@ -1516,7 +1513,8 @@ keep: /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + stat->nr_demoted = demote_folio_list(&demote_folios, pgdat); + nr_reclaimed += stat->nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ @@ -1982,6 +1980,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) From e98337d11bbdfa3e3f0fb99aa93e40f97549e0cd Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Tue, 13 Aug 2024 21:54:49 -0600 Subject: [PATCH 199/416] mm/contig_alloc: support __GFP_COMP Patch series "mm/hugetlb: alloc/free gigantic folios", v2. Use __GFP_COMP for gigantic folios can greatly reduce not only the amount of code but also the allocation and free time. Approximate LOC to mm/hugetlb.c: +60, -240 Allocate and free 500 1GB hugeTLB memory without HVO by: time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages Before After Alloc ~13s ~10s Free ~15s <1s The above magnitude generally holds for multiple x86 and arm64 CPU models. Perf profile before: Alloc - 99.99% alloc_pool_huge_folio - __alloc_fresh_hugetlb_folio - 83.23% alloc_contig_pages_noprof - 47.46% alloc_contig_range_noprof - 20.96% isolate_freepages_range 16.10% split_page - 14.10% start_isolate_page_range - 12.02% undo_isolate_page_range Free - update_and_free_pages_bulk - 87.71% free_contig_range - 76.02% free_unref_page - 41.30% free_unref_page_commit - 32.58% free_pcppages_bulk - 24.75% __free_one_page 13.96% _raw_spin_trylock 12.27% __update_and_free_hugetlb_folio Perf profile after: Alloc - 99.99% alloc_pool_huge_folio alloc_gigantic_folio - alloc_contig_pages_noprof - 59.15% alloc_contig_range_noprof - 20.72% start_isolate_page_range 20.64% prep_new_page - 17.13% undo_isolate_page_range Free - update_and_free_pages_bulk - __folio_put - __free_pages_ok 7.46% free_tail_page_prepare - 1.97% free_one_page 1.86% __free_one_page This patch (of 3): Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon success the function returns a large folio prepared by prep_new_page(), rather than a range of order-0 pages prepared by split_free_pages() (which is renamed from split_map_pages()). alloc_contig_range() can be used to allocate folios larger than MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path, free_one_page() needs to handle that by split_large_buddy(). [akpm@linux-foundation.org: fix folio_alloc_gigantic_noprof() WARN expression, per Yu Liao] Link: https://lkml.kernel.org/r/20240814035451.773331-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20240814035451.773331-2-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Zi Yan Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Frank van der Linden Signed-off-by: Andrew Morton --- include/linux/gfp.h | 23 +++++++++ mm/compaction.c | 41 ++-------------- mm/page_alloc.c | 111 +++++++++++++++++++++++++++++++------------- 3 files changed, 108 insertions(+), 67 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f53f76e0b17e..03ba9563c6db 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -446,4 +446,27 @@ extern struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_ #endif void free_contig_range(unsigned long pfn, unsigned long nr_pages); +#ifdef CONFIG_CONTIG_ALLOC +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, + int nid, nodemask_t *node) +{ + struct page *page; + + if (WARN_ON(!order || !(gfp & __GFP_COMP))) + return NULL; + + page = alloc_contig_pages_noprof(1 << order, gfp, nid, node); + + return page ? page_folio(page) : NULL; +} +#else +static inline struct folio *folio_alloc_gigantic_noprof(int order, gfp_t gfp, + int nid, nodemask_t *node) +{ + return NULL; +} +#endif +/* This should be paired with folio_put() rather than free_contig_range(). */ +#define folio_alloc_gigantic(...) alloc_hooks(folio_alloc_gigantic_noprof(__VA_ARGS__)) + #endif /* __LINUX_GFP_H */ diff --git a/mm/compaction.c b/mm/compaction.c index eb95e9b435d0..d1041fbce679 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -86,33 +86,6 @@ static struct page *mark_allocated_noprof(struct page *page, unsigned int order, } #define mark_allocated(...) alloc_hooks(mark_allocated_noprof(__VA_ARGS__)) -static void split_map_pages(struct list_head *freepages) -{ - unsigned int i, order; - struct page *page, *next; - LIST_HEAD(tmp_list); - - for (order = 0; order < NR_PAGE_ORDERS; order++) { - list_for_each_entry_safe(page, next, &freepages[order], lru) { - unsigned int nr_pages; - - list_del(&page->lru); - - nr_pages = 1 << order; - - mark_allocated(page, order, __GFP_MOVABLE); - if (order) - split_page(page, order); - - for (i = 0; i < nr_pages; i++) { - list_add(&page->lru, &tmp_list); - page++; - } - } - list_splice_init(&tmp_list, &freepages[0]); - } -} - static unsigned long release_free_list(struct list_head *freepages) { int order; @@ -742,11 +715,11 @@ isolate_fail: * * Non-free pages, invalid PFNs, or zone boundaries within the * [start_pfn, end_pfn) range are considered errors, cause function to - * undo its actions and return zero. + * undo its actions and return zero. cc->freepages[] are empty. * * Otherwise, function returns one-past-the-last PFN of isolated page * (which may be greater then end_pfn if end fell in a middle of - * a free page). + * a free page). cc->freepages[] contain free pages isolated. */ unsigned long isolate_freepages_range(struct compact_control *cc, @@ -754,10 +727,9 @@ isolate_freepages_range(struct compact_control *cc, { unsigned long isolated, pfn, block_start_pfn, block_end_pfn; int order; - struct list_head tmp_freepages[NR_PAGE_ORDERS]; for (order = 0; order < NR_PAGE_ORDERS; order++) - INIT_LIST_HEAD(&tmp_freepages[order]); + INIT_LIST_HEAD(&cc->freepages[order]); pfn = start_pfn; block_start_pfn = pageblock_start_pfn(pfn); @@ -788,7 +760,7 @@ isolate_freepages_range(struct compact_control *cc, break; isolated = isolate_freepages_block(cc, &isolate_start_pfn, - block_end_pfn, tmp_freepages, 0, true); + block_end_pfn, cc->freepages, 0, true); /* * In strict mode, isolate_freepages_block() returns 0 if @@ -807,13 +779,10 @@ isolate_freepages_range(struct compact_control *cc, if (pfn < end_pfn) { /* Loop terminated early, cleanup. */ - release_free_list(tmp_freepages); + release_free_list(cc->freepages); return 0; } - /* __isolate_free_page() does not map the pages */ - split_map_pages(tmp_freepages); - /* We don't use freelists for anything. */ return pfn; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 56a93805561a..da603080214c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1196,16 +1196,36 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock_irqrestore(&zone->lock, flags); } +/* Split a multi-block free page into its individual pageblocks. */ +static void split_large_buddy(struct zone *zone, struct page *page, + unsigned long pfn, int order, fpi_t fpi) +{ + unsigned long end = pfn + (1 << order); + + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order)); + /* Caller removed page from freelist, buddy info cleared! */ + VM_WARN_ON_ONCE(PageBuddy(page)); + + if (order > pageblock_order) + order = pageblock_order; + + while (pfn != end) { + int mt = get_pfnblock_migratetype(page, pfn); + + __free_one_page(page, pfn, zone, order, mt, fpi); + pfn += 1 << order; + page = pfn_to_page(pfn); + } +} + static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn, unsigned int order, fpi_t fpi_flags) { unsigned long flags; - int migratetype; spin_lock_irqsave(&zone->lock, flags); - migratetype = get_pfnblock_migratetype(page, pfn); - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); + split_large_buddy(zone, page, pfn, order, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); } @@ -1697,27 +1717,6 @@ static unsigned long find_large_buddy(unsigned long start_pfn) return start_pfn; } -/* Split a multi-block free page into its individual pageblocks */ -static void split_large_buddy(struct zone *zone, struct page *page, - unsigned long pfn, int order) -{ - unsigned long end_pfn = pfn + (1 << order); - - VM_WARN_ON_ONCE(order <= pageblock_order); - VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1)); - - /* Caller removed page from freelist, buddy info cleared! */ - VM_WARN_ON_ONCE(PageBuddy(page)); - - while (pfn != end_pfn) { - int mt = get_pfnblock_migratetype(page, pfn); - - __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE); - pfn += pageblock_nr_pages; - page = pfn_to_page(pfn); - } -} - /** * move_freepages_block_isolate - move free pages in block for page isolation * @zone: the zone @@ -1758,7 +1757,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, del_page_from_free_list(buddy, zone, order, get_pfnblock_migratetype(buddy, pfn)); set_pageblock_migratetype(page, migratetype); - split_large_buddy(zone, buddy, pfn, order); + split_large_buddy(zone, buddy, pfn, order, FPI_NONE); return true; } @@ -1769,7 +1768,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, del_page_from_free_list(page, zone, order, get_pfnblock_migratetype(page, pfn)); set_pageblock_migratetype(page, migratetype); - split_large_buddy(zone, page, pfn, order); + split_large_buddy(zone, page, pfn, order, FPI_NONE); return true; } move: @@ -6437,6 +6436,31 @@ int __alloc_contig_migrate_range(struct compact_control *cc, return (ret < 0) ? ret : 0; } +static void split_free_pages(struct list_head *list) +{ + int order; + + for (order = 0; order < NR_PAGE_ORDERS; order++) { + struct page *page, *next; + int nr_pages = 1 << order; + + list_for_each_entry_safe(page, next, &list[order], lru) { + int i; + + post_alloc_hook(page, order, __GFP_MOVABLE); + if (!order) + continue; + + split_page(page, order); + + /* Add all subpages to the order-0 head, in sequence. */ + list_del(&page->lru); + for (i = 0; i < nr_pages; i++) + list_add_tail(&page[i].lru, &list[0]); + } + } +} + /** * alloc_contig_range() -- tries to allocate given range of pages * @start: start PFN to allocate @@ -6549,12 +6573,25 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, goto done; } - /* Free head and tail (if any) */ - if (start != outer_start) - free_contig_range(outer_start, start - outer_start); - if (end != outer_end) - free_contig_range(end, outer_end - end); + if (!(gfp_mask & __GFP_COMP)) { + split_free_pages(cc.freepages); + /* Free head and tail (if any) */ + if (start != outer_start) + free_contig_range(outer_start, start - outer_start); + if (end != outer_end) + free_contig_range(end, outer_end - end); + } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { + struct page *head = pfn_to_page(start); + int order = ilog2(end - start); + + check_new_pages(head, order); + prep_new_page(head, order, gfp_mask, 0); + } else { + ret = -EINVAL; + WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n", + start, end, outer_start, outer_end); + } done: undo_isolate_page_range(start, end, migratetype); return ret; @@ -6663,6 +6700,18 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask, void free_contig_range(unsigned long pfn, unsigned long nr_pages) { unsigned long count = 0; + struct folio *folio = pfn_folio(pfn); + + if (folio_test_large(folio)) { + int expected = folio_nr_pages(folio); + + if (nr_pages == expected) + folio_put(folio); + else + WARN(true, "PFN %lu: nr_pages %lu != expected %d\n", + pfn, nr_pages, expected); + return; + } for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); From 463586e9ff398f951a9a63d98a3b93f44434d20c Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Tue, 13 Aug 2024 21:54:50 -0600 Subject: [PATCH 200/416] mm/cma: add cma_{alloc,free}_folio() With alloc_contig_range() and free_contig_range() supporting large folios, CMA can allocate and free large folios too, by cma_alloc_folio() and cma_free_folio(). [yuzhao@google.com: fix WARN in cma_alloc_folio()] Link: https://lkml.kernel.org/r/Zsd0PgAQmbpR8jS6@google.com Link: https://lkml.kernel.org/r/20240814035451.773331-3-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Zi Yan Cc: Frank van der Linden Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/cma.h | 16 +++++++++++++ mm/cma.c | 55 ++++++++++++++++++++++++++++++++------------- 2 files changed, 56 insertions(+), 15 deletions(-) diff --git a/include/linux/cma.h b/include/linux/cma.h index 9db877506ea8..d15b64f51336 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -52,4 +52,20 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data); extern void cma_reserve_pages_on_error(struct cma *cma); + +#ifdef CONFIG_CMA +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp); +bool cma_free_folio(struct cma *cma, const struct folio *folio); +#else +static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp) +{ + return NULL; +} + +static inline bool cma_free_folio(struct cma *cma, const struct folio *folio) +{ + return false; +} +#endif + #endif diff --git a/mm/cma.c b/mm/cma.c index 95d6950e177b..2d9fae939283 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma) spin_unlock_irq(&cma->lock); } -/** - * cma_alloc() - allocate pages from contiguous area - * @cma: Contiguous memory region for which the allocation is performed. - * @count: Requested number of pages. - * @align: Requested alignment of pages (in PAGE_SIZE order). - * @no_warn: Avoid printing message about failed allocation - * - * This function allocates part of contiguous memory on specific - * contiguous memory area. - */ -struct page *cma_alloc(struct cma *cma, unsigned long count, - unsigned int align, bool no_warn) +static struct page *__cma_alloc(struct cma *cma, unsigned long count, + unsigned int align, gfp_t gfp) { unsigned long mask, offset; unsigned long pfn = -1; @@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count, pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma_mutex); - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, - GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0)); + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp); mutex_unlock(&cma_mutex); if (ret == 0) { page = pfn_to_page(pfn); @@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count, page_kasan_tag_reset(nth_page(page, i)); } - if (ret && !no_warn) { + if (ret && !(gfp & __GFP_NOWARN)) { pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n", __func__, cma->name, count, ret); cma_debug_show_areas(cma); @@ -513,6 +502,34 @@ struct page *cma_alloc(struct cma *cma, unsigned long count, return page; } +/** + * cma_alloc() - allocate pages from contiguous area + * @cma: Contiguous memory region for which the allocation is performed. + * @count: Requested number of pages. + * @align: Requested alignment of pages (in PAGE_SIZE order). + * @no_warn: Avoid printing message about failed allocation + * + * This function allocates part of contiguous memory on specific + * contiguous memory area. + */ +struct page *cma_alloc(struct cma *cma, unsigned long count, + unsigned int align, bool no_warn) +{ + return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0)); +} + +struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp) +{ + struct page *page; + + if (WARN_ON(!order || !(gfp & __GFP_COMP))) + return NULL; + + page = __cma_alloc(cma, 1 << order, order, gfp); + + return page ? page_folio(page) : NULL; +} + bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count) { @@ -564,6 +581,14 @@ bool cma_release(struct cma *cma, const struct page *pages, return true; } +bool cma_free_folio(struct cma *cma, const struct folio *folio) +{ + if (WARN_ON(!folio_test_large(folio))) + return false; + + return cma_release(cma, &folio->page, folio_nr_pages(folio)); +} + int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data) { int i; From cf54f310d0d313bce6505d010a555948b0ae0e2a Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Tue, 13 Aug 2024 21:54:51 -0600 Subject: [PATCH 201/416] mm/hugetlb: use __GFP_COMP for gigantic folios Use __GFP_COMP for gigantic folios to greatly reduce not only the amount of code but also the allocation and free time. LOC (approximately): +60, -240 Allocate and free 500 1GB hugeTLB memory without HVO by: time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages Before After Alloc ~13s ~10s Free ~15s <1s The above magnitude generally holds for multiple x86 and arm64 CPU models. Link: https://lkml.kernel.org/r/20240814035451.773331-4-yuzhao@google.com Signed-off-by: Yu Zhao Reported-by: Frank van der Linden Acked-by: Zi Yan Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Signed-off-by: Andrew Morton --- include/linux/hugetlb.h | 9 +- mm/hugetlb.c | 290 ++++++++-------------------------------- 2 files changed, 61 insertions(+), 238 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3100a52ceb73..98c47c394b89 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -896,10 +896,11 @@ static inline bool hugepage_movable_supported(struct hstate *h) /* Movability of hugepages depends on migration support. */ static inline gfp_t htlb_alloc_mask(struct hstate *h) { - if (hugepage_movable_supported(h)) - return GFP_HIGHUSER_MOVABLE; - else - return GFP_HIGHUSER; + gfp_t gfp = __GFP_COMP | __GFP_NOWARN; + + gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER; + + return gfp; } static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d2b9555e6c45..4461d27f7453 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -56,16 +56,6 @@ struct hstate hstates[HUGE_MAX_HSTATE]; #ifdef CONFIG_CMA static struct cma *hugetlb_cma[MAX_NUMNODES]; static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata; -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order) -{ - return cma_pages_valid(hugetlb_cma[folio_nid(folio)], &folio->page, - 1 << order); -} -#else -static bool hugetlb_cma_folio(struct folio *folio, unsigned int order) -{ - return false; -} #endif static unsigned long hugetlb_cma_size __initdata; @@ -100,6 +90,17 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long start, unsigned long end); static struct resv_map *vma_resv_map(struct vm_area_struct *vma); +static void hugetlb_free_folio(struct folio *folio) +{ +#ifdef CONFIG_CMA + int nid = folio_nid(folio); + + if (cma_free_folio(hugetlb_cma[nid], folio)) + return; +#endif + folio_put(folio); +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -1512,95 +1513,54 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) ((node = hstate_next_node_to_free(hs, mask)) || 1); \ nr_nodes--) -/* used to demote non-gigantic_huge pages as well */ -static void __destroy_compound_gigantic_folio(struct folio *folio, - unsigned int order, bool demote) -{ - int i; - int nr_pages = 1 << order; - struct page *p; - - atomic_set(&folio->_entire_mapcount, 0); - atomic_set(&folio->_large_mapcount, 0); - atomic_set(&folio->_pincount, 0); - - for (i = 1; i < nr_pages; i++) { - p = folio_page(folio, i); - p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE; - p->mapping = NULL; - clear_compound_head(p); - if (!demote) - set_page_refcounted(p); - } - - __folio_clear_head(folio); -} - -static void destroy_compound_hugetlb_folio_for_demote(struct folio *folio, - unsigned int order) -{ - __destroy_compound_gigantic_folio(folio, order, true); -} - #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -static void destroy_compound_gigantic_folio(struct folio *folio, - unsigned int order) -{ - __destroy_compound_gigantic_folio(folio, order, false); -} - -static void free_gigantic_folio(struct folio *folio, unsigned int order) -{ - /* - * If the page isn't allocated using the cma allocator, - * cma_release() returns false. - */ -#ifdef CONFIG_CMA - int nid = folio_nid(folio); - - if (cma_release(hugetlb_cma[nid], &folio->page, 1 << order)) - return; -#endif - - free_contig_range(folio_pfn(folio), 1 << order); -} - #ifdef CONFIG_CONTIG_ALLOC static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nodemask) { - struct page *page; - unsigned long nr_pages = pages_per_huge_page(h); + struct folio *folio; + int order = huge_page_order(h); + bool retried = false; + if (nid == NUMA_NO_NODE) nid = numa_mem_id(); - +retry: + folio = NULL; #ifdef CONFIG_CMA { int node; - if (hugetlb_cma[nid]) { - page = cma_alloc(hugetlb_cma[nid], nr_pages, - huge_page_order(h), true); - if (page) - return page_folio(page); - } + if (hugetlb_cma[nid]) + folio = cma_alloc_folio(hugetlb_cma[nid], order, gfp_mask); - if (!(gfp_mask & __GFP_THISNODE)) { + if (!folio && !(gfp_mask & __GFP_THISNODE)) { for_each_node_mask(node, *nodemask) { if (node == nid || !hugetlb_cma[node]) continue; - page = cma_alloc(hugetlb_cma[node], nr_pages, - huge_page_order(h), true); - if (page) - return page_folio(page); + folio = cma_alloc_folio(hugetlb_cma[node], order, gfp_mask); + if (folio) + break; } } } #endif + if (!folio) { + folio = folio_alloc_gigantic(order, gfp_mask, nid, nodemask); + if (!folio) + return NULL; + } - page = alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask); - return page ? page_folio(page) : NULL; + if (folio_ref_freeze(folio, 1)) + return folio; + + pr_warn("HugeTLB: unexpected refcount on PFN %lu\n", folio_pfn(folio)); + hugetlb_free_folio(folio); + if (!retried) { + retried = true; + goto retry; + } + return NULL; } #else /* !CONFIG_CONTIG_ALLOC */ @@ -1617,10 +1577,6 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask, { return NULL; } -static inline void free_gigantic_folio(struct folio *folio, - unsigned int order) { } -static inline void destroy_compound_gigantic_folio(struct folio *folio, - unsigned int order) { } #endif /* @@ -1748,18 +1704,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, folio_ref_unfreeze(folio, 1); - /* - * Non-gigantic pages demoted from CMA allocated gigantic pages - * need to be given back to CMA in free_gigantic_folio. - */ - if (hstate_is_gigantic(h) || - hugetlb_cma_folio(folio, huge_page_order(h))) { - destroy_compound_gigantic_folio(folio, huge_page_order(h)); - free_gigantic_folio(folio, huge_page_order(h)); - } else { - INIT_LIST_HEAD(&folio->_deferred_list); - folio_put(folio); - } + INIT_LIST_HEAD(&folio->_deferred_list); + hugetlb_free_folio(folio); } /* @@ -2032,95 +1978,6 @@ static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni spin_unlock_irq(&hugetlb_lock); } -static bool __prep_compound_gigantic_folio(struct folio *folio, - unsigned int order, bool demote) -{ - int i, j; - int nr_pages = 1 << order; - struct page *p; - - __folio_clear_reserved(folio); - for (i = 0; i < nr_pages; i++) { - p = folio_page(folio, i); - - /* - * For gigantic hugepages allocated through bootmem at - * boot, it's safer to be consistent with the not-gigantic - * hugepages and clear the PG_reserved bit from all tail pages - * too. Otherwise drivers using get_user_pages() to access tail - * pages may get the reference counting wrong if they see - * PG_reserved set on a tail page (despite the head page not - * having PG_reserved set). Enforcing this consistency between - * head and tail pages allows drivers to optimize away a check - * on the head page when they need know if put_page() is needed - * after get_user_pages(). - */ - if (i != 0) /* head page cleared above */ - __ClearPageReserved(p); - /* - * Subtle and very unlikely - * - * Gigantic 'page allocators' such as memblock or cma will - * return a set of pages with each page ref counted. We need - * to turn this set of pages into a compound page with tail - * page ref counts set to zero. Code such as speculative page - * cache adding could take a ref on a 'to be' tail page. - * We need to respect any increased ref count, and only set - * the ref count to zero if count is currently 1. If count - * is not 1, we return an error. An error return indicates - * the set of pages can not be converted to a gigantic page. - * The caller who allocated the pages should then discard the - * pages using the appropriate free interface. - * - * In the case of demote, the ref count will be zero. - */ - if (!demote) { - if (!page_ref_freeze(p, 1)) { - pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); - goto out_error; - } - } else { - VM_BUG_ON_PAGE(page_count(p), p); - } - if (i != 0) - set_compound_head(p, &folio->page); - } - __folio_set_head(folio); - /* we rely on prep_new_hugetlb_folio to set the hugetlb flag */ - folio_set_order(folio, order); - atomic_set(&folio->_entire_mapcount, -1); - atomic_set(&folio->_large_mapcount, -1); - atomic_set(&folio->_pincount, 0); - return true; - -out_error: - /* undo page modifications made above */ - for (j = 0; j < i; j++) { - p = folio_page(folio, j); - if (j != 0) - clear_compound_head(p); - set_page_refcounted(p); - } - /* need to clear PG_reserved on remaining tail pages */ - for (; j < nr_pages; j++) { - p = folio_page(folio, j); - __ClearPageReserved(p); - } - return false; -} - -static bool prep_compound_gigantic_folio(struct folio *folio, - unsigned int order) -{ - return __prep_compound_gigantic_folio(folio, order, false); -} - -static bool prep_compound_gigantic_folio_for_demote(struct folio *folio, - unsigned int order) -{ - return __prep_compound_gigantic_folio(folio, order, true); -} - /* * Find and lock address space (mapping) in write mode. * @@ -2159,7 +2016,6 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, */ if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry)) alloc_try_hard = false; - gfp_mask |= __GFP_COMP|__GFP_NOWARN; if (alloc_try_hard) gfp_mask |= __GFP_RETRY_MAYFAIL; if (nid == NUMA_NO_NODE) @@ -2206,48 +2062,16 @@ retry: return folio; } -static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask, - nodemask_t *node_alloc_noretry) -{ - struct folio *folio; - bool retry = false; - -retry: - if (hstate_is_gigantic(h)) - folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask); - else - folio = alloc_buddy_hugetlb_folio(h, gfp_mask, - nid, nmask, node_alloc_noretry); - if (!folio) - return NULL; - - if (hstate_is_gigantic(h)) { - if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) { - /* - * Rare failure to convert pages to compound page. - * Free pages and try again - ONCE! - */ - free_gigantic_folio(folio, huge_page_order(h)); - if (!retry) { - retry = true; - goto retry; - } - return NULL; - } - } - - return folio; -} - static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry) { struct folio *folio; - folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, - node_alloc_noretry); + if (hstate_is_gigantic(h)) + folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask); + else + folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, node_alloc_noretry); if (folio) init_new_hugetlb_folio(h, folio); return folio; @@ -2265,7 +2089,10 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h, { struct folio *folio; - folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL); + if (hstate_is_gigantic(h)) + folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask); + else + folio = alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, NULL); if (!folio) return NULL; @@ -2549,9 +2376,8 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h, nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask); if (mpol_is_preferred_many(mpol)) { - gfp_t gfp = gfp_mask | __GFP_NOWARN; + gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask); /* Fallback to all nodes if page==NULL */ @@ -3333,6 +3159,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); + __ClearPageReserved(folio_page(folio, pfn - head_pfn)); __init_single_page(page, pfn, zone, nid); prep_compound_tail((struct page *)folio, pfn - head_pfn); ret = page_ref_freeze(page, 1); @@ -3949,21 +3776,16 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst, continue; list_del(&folio->lru); - /* - * Use destroy_compound_hugetlb_folio_for_demote for all huge page - * sizes as it will not ref count folios. - */ - destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(src)); + + split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst)); + pgalloc_tag_split(&folio->page, 1 << huge_page_order(src)); for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) { struct page *page = folio_page(folio, i); - if (hstate_is_gigantic(dst)) - prep_compound_gigantic_folio_for_demote(page_folio(page), - dst->order); - else - prep_compound_page(page, dst->order); - set_page_private(page, 0); + page->mapping = NULL; + clear_compound_head(page); + prep_compound_page(page, dst->order); init_new_hugetlb_folio(dst, page_folio(page)); list_add(&page->lru, &dst_list); From d0b003ce97ad6518dea4b0ed70081df95de04179 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 16 Aug 2024 12:32:46 +0200 Subject: [PATCH 202/416] mm/rmap: use folio->_mapcount for small folios We have some cases left whereby we operate on small folios and still refer to page->_mapcount. Let's just use folio->_mapcount instead, which currently still overlays page->_mapcount, so no change. This change will make it easier to later spot any remaining users of page->_mapcount that target tail pages. Link: https://lkml.kernel.org/r/20240816103246.719209-1-david@redhat.com Signed-off-by: David Hildenbrand Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- include/linux/rmap.h | 4 ++-- mm/rmap.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0978c64f49d8..91b5935e8485 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -331,7 +331,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: if (!folio_test_large(folio)) { - atomic_inc(&page->_mapcount); + atomic_inc(&folio->_mapcount); break; } @@ -425,7 +425,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, if (!folio_test_large(folio)) { if (PageAnonExclusive(page)) ClearPageAnonExclusive(page); - atomic_inc(&page->_mapcount); + atomic_inc(&folio->_mapcount); break; } diff --git a/mm/rmap.c b/mm/rmap.c index a6b9cd0b2b18..1103a536e474 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1165,7 +1165,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: if (!folio_test_large(folio)) { - nr = atomic_inc_and_test(&page->_mapcount); + nr = atomic_inc_and_test(&folio->_mapcount); break; } @@ -1535,7 +1535,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: if (!folio_test_large(folio)) { - nr = atomic_add_negative(-1, &page->_mapcount); + nr = atomic_add_negative(-1, &folio->_mapcount); break; } From 1a83a716ec233990e1fd5b6fbb1200ade63bf450 Mon Sep 17 00:00:00 2001 From: Danilo Krummrich Date: Tue, 13 Aug 2024 00:34:34 +0200 Subject: [PATCH 203/416] mm: krealloc: consider spare memory for __GFP_ZERO As long as krealloc() is called with __GFP_ZERO consistently, starting with the initial memory allocation, __GFP_ZERO should be fully honored. However, if for an existing allocation krealloc() is called with a decreased size, it is not ensured that the spare portion the allocation is zeroed. Thus, if krealloc() is subsequently called with a larger size again, __GFP_ZERO can't be fully honored, since we don't know the previous size, but only the bucket size. Example: buf = kzalloc(64, GFP_KERNEL); memset(buf, 0xff, 64); buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO); /* After this call the last 16 bytes are still 0xff. */ buf = krealloc(buf, 64, GFP_KERNEL | __GFP_ZERO); Fix this, by explicitly setting spare memory to zero, when shrinking an allocation with __GFP_ZERO flag set or init_on_alloc enabled. Link: https://lkml.kernel.org/r/20240812223707.32049-1-dakr@kernel.org Signed-off-by: Danilo Krummrich Acked-by: Vlastimil Babka Acked-by: David Rientjes Cc: Christoph Lameter Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: Pekka Enberg Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton --- mm/slab_common.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/slab_common.c b/mm/slab_common.c index 40b582a014b8..cff602cedf8e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1273,6 +1273,13 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags) /* If the object still fits, repoison it precisely. */ if (ks >= new_size) { + /* Zero out spare memory. */ + if (want_init_on_alloc(flags)) { + kasan_disable_current(); + memset((void *)p + new_size, 0, ks - new_size); + kasan_enable_current(); + } + p = kasan_krealloc((void *)p, new_size, flags); return (void *)p; } From 489a744e5fb16503b495ad1fa3dd1abf25b224e7 Mon Sep 17 00:00:00 2001 From: Danilo Krummrich Date: Tue, 13 Aug 2024 00:34:35 +0200 Subject: [PATCH 204/416] mm: krealloc: clarify valid usage of __GFP_ZERO Properly document that if __GFP_ZERO logic is requested, callers must ensure that, starting with the initial memory allocation, every subsequent call to this API for the same memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully honored by this API. Link: https://lkml.kernel.org/r/20240812223707.32049-2-dakr@kernel.org Signed-off-by: Danilo Krummrich Acked-by: David Rientjes Cc: Christoph Lameter Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: Pekka Enberg Cc: Roman Gushchin Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/slab.h | 10 ++++++++++ mm/slab_common.c | 20 ++++++++++++++++++-- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index c9cb42203183..2282e67a01c7 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -733,6 +733,16 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz * @new_n: new number of elements to alloc * @new_size: new size of a single member of the array * @flags: the type of memory to allocate (see kmalloc) + * + * If __GFP_ZERO logic is requested, callers must ensure that, starting with the + * initial memory allocation, every subsequent call to this API for the same + * memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that + * __GFP_ZERO is not fully honored by this API. + * + * See krealloc_noprof() for further details. + * + * In any case, the contents of the object pointed to are preserved up to the + * lesser of the new and old sizes. */ static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(void *p, size_t new_n, diff --git a/mm/slab_common.c b/mm/slab_common.c index cff602cedf8e..1b380eb3b4f2 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1301,11 +1301,27 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags) * @new_size: how many bytes of memory are required. * @flags: the type of memory to allocate. * - * The contents of the object pointed to are preserved up to the - * lesser of the new and old sizes (__GFP_ZERO flag is effectively ignored). * If @p is %NULL, krealloc() behaves exactly like kmalloc(). If @new_size * is 0 and @p is not a %NULL pointer, the object pointed to is freed. * + * If __GFP_ZERO logic is requested, callers must ensure that, starting with the + * initial memory allocation, every subsequent call to this API for the same + * memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that + * __GFP_ZERO is not fully honored by this API. + * + * This is the case, since krealloc() only knows about the bucket size of an + * allocation (but not the exact size it was allocated with) and hence + * implements the following semantics for shrinking and growing buffers with + * __GFP_ZERO. + * + * new bucket + * 0 size size + * |--------|----------------| + * | keep | zero | + * + * In any case, the contents of the object pointed to are preserved up to the + * lesser of the new and old sizes. + * * Return: pointer to the allocated memory or %NULL in case of error */ void *krealloc_noprof(const void *p, size_t new_size, gfp_t flags) From 2f4db2861013fcfb4be11d0dc176271ae566241c Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Mon, 19 Aug 2024 21:06:09 +0800 Subject: [PATCH 205/416] selftests/mm: remove unnecessary ia64 code and comment IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"), so remove unnecessary ia64 special mm code and comment in selftests too. Link: https://lkml.kernel.org/r/20240819130609.3386195-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu Cc: Kefeng Wang Cc: Mike Rapoport Cc: Nanyong Sun Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/Makefile | 2 +- tools/testing/selftests/mm/hugepage-mmap.c | 18 +----------------- tools/testing/selftests/mm/hugepage-shm.c | 18 +----------------- tools/testing/selftests/mm/hugepage-vmemmap.c | 17 ++--------------- tools/testing/selftests/mm/map_hugetlb.c | 18 ++---------------- tools/testing/selftests/mm/run_vmtests.sh | 2 +- 6 files changed, 8 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..0533c5ee0b3e 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -112,7 +112,7 @@ endif endif -ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) +ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) TEST_GEN_FILES += va_high_addr_switch TEST_GEN_FILES += virtual_address_range TEST_GEN_FILES += write_to_hugetlbfs diff --git a/tools/testing/selftests/mm/hugepage-mmap.c b/tools/testing/selftests/mm/hugepage-mmap.c index 267eea2e0e0b..3b1b532f1cbb 100644 --- a/tools/testing/selftests/mm/hugepage-mmap.c +++ b/tools/testing/selftests/mm/hugepage-mmap.c @@ -8,13 +8,6 @@ * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this * example, the app is requesting memory of size 256MB that is backed by * huge pages. - * - * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * huge pages. That means that if one requires a fixed address, a huge page - * aligned address starting with 0x800000... will be required. If a fixed - * address is not required, the kernel will select an address in the proper - * range. - * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. */ #define _GNU_SOURCE #include @@ -27,15 +20,6 @@ #define LENGTH (256UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define FLAGS (MAP_SHARED | MAP_FIXED) -#else -#define ADDR (void *)(0x0UL) -#define FLAGS (MAP_SHARED) -#endif - static void check_bytes(char *addr) { ksft_print_msg("First hex is %x\n", *((unsigned int *)addr)); @@ -74,7 +58,7 @@ int main(void) if (fd < 0) ksft_exit_fail_msg("memfd_create() failed: %s\n", strerror(errno)); - addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0); + addr = mmap(NULL, LENGTH, PROTECTION, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) { close(fd); ksft_exit_fail_msg("mmap(): %s\n", strerror(errno)); diff --git a/tools/testing/selftests/mm/hugepage-shm.c b/tools/testing/selftests/mm/hugepage-shm.c index 478bb1e989e9..ef06260802b5 100644 --- a/tools/testing/selftests/mm/hugepage-shm.c +++ b/tools/testing/selftests/mm/hugepage-shm.c @@ -8,13 +8,6 @@ * SHM_HUGETLB in the shmget system call to inform the kernel that it is * requesting huge pages. * - * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * huge pages. That means that if one requires a fixed address, a huge page - * aligned address starting with 0x800000... will be required. If a fixed - * address is not required, the kernel will select an address in the proper - * range. - * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. - * * Note: The default shared memory limit is quite low on many kernels, * you may need to increase it via: * @@ -39,15 +32,6 @@ #define dprintf(x) printf(x) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define SHMAT_FLAGS (SHM_RND) -#else -#define ADDR (void *)(0x0UL) -#define SHMAT_FLAGS (0) -#endif - int main(void) { int shmid; @@ -61,7 +45,7 @@ int main(void) } printf("shmid: 0x%x\n", shmid); - shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS); + shmaddr = shmat(shmid, NULL, 0); if (shmaddr == (char *)-1) { perror("Shared memory attach failure"); shmctl(shmid, IPC_RMID, NULL); diff --git a/tools/testing/selftests/mm/hugepage-vmemmap.c b/tools/testing/selftests/mm/hugepage-vmemmap.c index 894d28c3dd47..df366a4d1b92 100644 --- a/tools/testing/selftests/mm/hugepage-vmemmap.c +++ b/tools/testing/selftests/mm/hugepage-vmemmap.c @@ -22,20 +22,6 @@ #define PM_PFRAME_BITS 55 #define PM_PFRAME_MASK ~((1UL << PM_PFRAME_BITS) - 1) -/* - * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. - * That means the addresses starting with 0x800000... will need to be - * specified. Specifying a fixed address is not required on ppc64, i386 - * or x86_64. - */ -#ifdef __ia64__ -#define MAP_ADDR (void *)(0x8000000000000000UL) -#define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED) -#else -#define MAP_ADDR NULL -#define MAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) -#endif - static size_t pagesize; static size_t maplength; @@ -113,7 +99,8 @@ int main(int argc, char **argv) exit(1); } - addr = mmap(MAP_ADDR, maplength, PROT_READ | PROT_WRITE, MAP_FLAGS, -1, 0); + addr = mmap(NULL, maplength, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(1); diff --git a/tools/testing/selftests/mm/map_hugetlb.c b/tools/testing/selftests/mm/map_hugetlb.c index a1f005a90a4f..b47399feab53 100644 --- a/tools/testing/selftests/mm/map_hugetlb.c +++ b/tools/testing/selftests/mm/map_hugetlb.c @@ -4,11 +4,6 @@ * system call with MAP_HUGETLB flag. Before running this program make * sure the administrator has allocated enough default sized huge pages * to cover the 256 MB allocation. - * - * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. - * That means the addresses starting with 0x800000... will need to be - * specified. Specifying a fixed address is not required on ppc64, i386 - * or x86_64. */ #include #include @@ -21,15 +16,6 @@ #define LENGTH (256UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED) -#else -#define ADDR (void *)(0x0UL) -#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) -#endif - static void check_bytes(char *addr) { ksft_print_msg("First hex is %x\n", *((unsigned int *)addr)); @@ -60,7 +46,7 @@ int main(int argc, char **argv) void *addr; size_t hugepage_size; size_t length = LENGTH; - int flags = FLAGS; + int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB; int shift = 0; hugepage_size = default_huge_page_size(); @@ -85,7 +71,7 @@ int main(int argc, char **argv) ksft_print_msg("Default size hugepages\n"); ksft_print_msg("Mapping %lu Mbytes\n", (unsigned long)length >> 20); - addr = mmap(ADDR, length, PROTECTION, flags, -1, 0); + addr = mmap(NULL, length, PROTECTION, flags, -1, 0); if (addr == MAP_FAILED) ksft_exit_fail_msg("mmap: %s\n", strerror(errno)); diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 36045edb10de..c5797ad1d37b 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -189,7 +189,7 @@ else fi # filter 64bit architectures -ARCH64STR="arm64 ia64 mips64 parisc64 ppc64 ppc64le riscv64 s390x sparc64 x86_64" +ARCH64STR="arm64 mips64 parisc64 ppc64 ppc64le riscv64 s390x sparc64 x86_64" if [ -z "$ARCH" ]; then ARCH=$(uname -m 2>/dev/null | sed -e 's/aarch64.*/arm64/') fi From a759e37fb46708029c9c3c56c3b62e6f24d85cf5 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Sun, 18 Aug 2024 23:01:51 +0200 Subject: [PATCH 206/416] err.h: add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros Add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros that operate on pointers in the percpu address space. These macros remove the need for (__force void *) function argument casts (to avoid sparse -Wcast-from-as warnings). The patch will also avoid future build errors due to pointer address space mismatch with enabled strict percpu address space checks. Link: https://lkml.kernel.org/r/20240818210235.33481-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak Acked-by: Catalin Marinas Signed-off-by: Andrew Morton --- include/linux/err.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/include/linux/err.h b/include/linux/err.h index b5d9bb2a2349..a4dacd745fcf 100644 --- a/include/linux/err.h +++ b/include/linux/err.h @@ -41,6 +41,9 @@ static inline void * __must_check ERR_PTR(long error) return (void *) error; } +/* Return the pointer in the percpu address space. */ +#define ERR_PTR_PCPU(error) ((void __percpu *)(unsigned long)ERR_PTR(error)) + /** * PTR_ERR - Extract the error code from an error pointer. * @ptr: An error pointer. @@ -51,6 +54,9 @@ static inline long __must_check PTR_ERR(__force const void *ptr) return (long) ptr; } +/* Read an error pointer from the percpu address space. */ +#define PTR_ERR_PCPU(ptr) (PTR_ERR((const void *)(__force const unsigned long)(ptr))) + /** * IS_ERR - Detect an error pointer. * @ptr: The pointer to check. @@ -61,6 +67,9 @@ static inline bool __must_check IS_ERR(__force const void *ptr) return IS_ERR_VALUE((unsigned long)ptr); } +/* Read an error pointer from the percpu address space. */ +#define IS_ERR_PCPU(ptr) (IS_ERR((const void *)(__force const unsigned long)(ptr))) + /** * IS_ERR_OR_NULL - Detect an error pointer or a null pointer. * @ptr: The pointer to check. From 8c8685928910d0aae477dc7683037c3fcb3a904a Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Sun, 18 Aug 2024 23:01:52 +0200 Subject: [PATCH 207/416] mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space Use IS_ERR_PCPU() instead of IS_ERR() for pointers in the percpu address space. The patch also fixes following sparse warnings: kmemleak.c:1063:39: warning: cast removes address space '__percpu' of expression kmemleak.c:1138:37: warning: cast removes address space '__percpu' of expression Link: https://lkml.kernel.org/r/20240818210235.33481-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak Acked-by: Catalin Marinas Signed-off-by: Andrew Morton --- mm/kmemleak.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 6b498c6d9c34..0400f5e8ac60 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1070,8 +1070,8 @@ void __ref kmemleak_alloc_percpu(const void __percpu *ptr, size_t size, { pr_debug("%s(0x%px, %zu)\n", __func__, ptr, size); - if (kmemleak_enabled && ptr && !IS_ERR(ptr)) - create_object_percpu((unsigned long)ptr, size, 1, gfp); + if (kmemleak_enabled && ptr && !IS_ERR_PCPU(ptr)) + create_object_percpu((__force unsigned long)ptr, size, 0, gfp); } EXPORT_SYMBOL_GPL(kmemleak_alloc_percpu); @@ -1145,8 +1145,8 @@ void __ref kmemleak_free_percpu(const void __percpu *ptr) { pr_debug("%s(0x%px)\n", __func__, ptr); - if (kmemleak_free_enabled && ptr && !IS_ERR(ptr)) - delete_object_full((unsigned long)ptr, OBJECT_PERCPU); + if (kmemleak_free_enabled && ptr && !IS_ERR_PCPU(ptr)) + delete_object_full((__force unsigned long)ptr, OBJECT_PERCPU); } EXPORT_SYMBOL_GPL(kmemleak_free_percpu); From ef5f379de302884b9b7ad9b62587a942a9f0bb55 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 20 Aug 2024 14:22:10 +0200 Subject: [PATCH 208/416] mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y We already force-inline page_fixed_fake_head(), page_is_fake_head() and PageTail(), however the compiler might decide that _compound_head() is not worthy to be inlined, because of page_fixed_fake_head(). The result is that, for example, PageAnonExclusive() now might involve a function call when checking PageHuge(), which performs a page_folio()->_compound_head() call. This can lead to a slight regression of the stress-ng.clone benchmark. This is not super-urgent to fix, but always inlining _compound_head() seems like the obvious thing to do for this primitive, similar to the other ones. This change restores the slight regression and a compilation with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]: add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081) ... Total: Before=32786363, After=32785282, chg -0.00% [1] https://lkml.kernel.org/r/817150f2-abf7-430f-9973-540bd6cdd26f@intel.com [2] https://lore.kernel.org/all/116e117c-2821-401d-8e62-b85cdec37f4a@redhat.com/ Link: https://lkml.kernel.org/r/20240820122210.660140-1-david@redhat.com Fixes: c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages") Signed-off-by: David Hildenbrand Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202407301049.5051dc19-oliver.sang@intel.com Tested-by: Yin Fengwei Cc: Peter Xu Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 17175a328e6b..2515ae85f191 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -232,7 +232,7 @@ static __always_inline int page_is_fake_head(const struct page *page) return page_fixed_fake_head(page) != page; } -static inline unsigned long _compound_head(const struct page *page) +static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); From fda6d4de064a9d37414df36d45836898fff5e165 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 20 Aug 2024 17:49:13 +0800 Subject: [PATCH 209/416] mm: khugepaged: expand the is_refcount_suitable() to support file folios Patch series "support shmem mTHP collapse", v2. Shmem already supports mTHP allocation[1], and this patchset adds support for shmem mTHP collapse, as well as adding relevant test cases. This patch (of 5): Expand the is_refcount_suitable() to support reference checks for file folios, as preparation for supporting shmem mTHP collapse. Link: https://lkml.kernel.org/r/cover.1724140601.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/eae4cb3195ebbb654bfb7967cb7261d4e4e7c7fa.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/khugepaged.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cdd1d8655a76..76f05215ee87 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -546,12 +546,14 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, static bool is_refcount_suitable(struct folio *folio) { - int expected_refcount; + int expected_refcount = folio_mapcount(folio); - expected_refcount = folio_mapcount(folio); - if (folio_test_swapcache(folio)) + if (!folio_test_anon(folio) || folio_test_swapcache(folio)) expected_refcount += folio_nr_pages(folio); + if (folio_test_private(folio)) + expected_refcount++; + return folio_ref_count(folio) == expected_refcount; } @@ -2285,8 +2287,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, break; } - if (folio_ref_count(folio) != - 1 + folio_mapcount(folio) + folio_test_private(folio)) { + if (!is_refcount_suitable(folio)) { result = SCAN_PAGE_COUNT; break; } From d6b8f296e8d74d685627fd746558745c13b8bd32 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 20 Aug 2024 17:49:14 +0800 Subject: [PATCH 210/416] mm: khugepaged: use the number of pages in the folio to check the reference count Use the number of pages in the folio to check the reference count as preparation for supporting shmem mTHP collapse. Link: https://lkml.kernel.org/r/9ea49262308de28957596cc6e8edc2d3a4f54659.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/khugepaged.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 76f05215ee87..e015c94eba09 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1988,9 +1988,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, VM_BUG_ON_FOLIO(folio != xa_load(xas.xa, index), folio); /* - * We control three references to the folio: + * We control 2 + nr_pages references to the folio: * - we hold a pin on it; - * - one reference from page cache; + * - nr_pages reference from page cache; * - one from lru_isolate_folio; * If those are the only references, then any new usage * of the folio will have to fetch it from the page @@ -1998,7 +1998,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, * truncate, so any new usage will be blocked until we * unlock folio after collapse/during rollback. */ - if (folio_ref_count(folio) != 3) { + if (folio_ref_count(folio) != 2 + folio_nr_pages(folio)) { result = SCAN_PAGE_COUNT; xas_unlock_irq(&xas); folio_putback_lru(folio); @@ -2181,7 +2181,7 @@ immap_locked: folio_clear_active(folio); folio_clear_unevictable(folio); folio_unlock(folio); - folio_put_refs(folio, 3); + folio_put_refs(folio, 2 + folio_nr_pages(folio)); } goto out; From dfa98f56d932fca3eaadaed8c17393fdfa00574d Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 20 Aug 2024 17:49:15 +0800 Subject: [PATCH 211/416] mm: khugepaged: support shmem mTHP copy Iterate each subpage in the large folio to copy, as preparation for supporting shmem mTHP collapse. Link: https://lkml.kernel.org/r/222d615b7c837eabb47a238126c5fdeff8aa5283.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/khugepaged.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e015c94eba09..4996f7487c13 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2056,17 +2056,22 @@ xa_unlocked: index = start; dst = folio_page(new_folio, 0); list_for_each_entry(folio, &pagelist, lru) { + int i, nr_pages = folio_nr_pages(folio); + while (index < folio->index) { clear_highpage(dst); index++; dst++; } - if (copy_mc_highpage(dst, folio_page(folio, 0)) > 0) { - result = SCAN_COPY_MC; - goto rollback; + + for (i = 0; i < nr_pages; i++) { + if (copy_mc_highpage(dst, folio_page(folio, i)) > 0) { + result = SCAN_COPY_MC; + goto rollback; + } + index++; + dst++; } - index++; - dst++; } while (index < end) { clear_highpage(dst); From 7de856ffd007f132fd4a0474b289a622a3f88cd7 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 20 Aug 2024 17:49:16 +0800 Subject: [PATCH 212/416] mm: khugepaged: support shmem mTHP collapse Shmem already supports the allocation of mTHP, but khugepaged does not yet support collapsing mTHP folios. Now khugepaged is ready to support mTHP, and this patch enables the collapse of shmem mTHP. Link: https://lkml.kernel.org/r/b9da76aab4276eb6e5d12c479af2b5eea5b4575d.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/khugepaged.c | 28 +++++++++++----------------- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4996f7487c13..4a83c40d9053 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1843,7 +1843,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } } while (1); - for (index = start; index < end; index++) { + for (index = start; index < end;) { xas_set(&xas, index); folio = xas_load(&xas); @@ -1862,6 +1862,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, } } nr_none++; + index++; continue; } @@ -1943,12 +1944,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, * we locked the first folio, then a THP might be there already. * This will be discovered on the first iteration. */ - if (folio_test_large(folio)) { - result = folio_order(folio) == HPAGE_PMD_ORDER && - folio->index == start - /* Maybe PMD-mapped */ - ? SCAN_PTE_MAPPED_HUGEPAGE - : SCAN_PAGE_COMPOUND; + if (folio_order(folio) == HPAGE_PMD_ORDER && + folio->index == start) { + /* Maybe PMD-mapped */ + result = SCAN_PTE_MAPPED_HUGEPAGE; goto out_unlock; } @@ -2009,6 +2008,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, * Accumulate the folios that are being collapsed. */ list_add_tail(&folio->lru, &pagelist); + index += folio_nr_pages(folio); continue; out_unlock: folio_unlock(folio); @@ -2261,16 +2261,10 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, continue; } - /* - * TODO: khugepaged should compact smaller compound pages - * into a PMD sized page - */ - if (folio_test_large(folio)) { - result = folio_order(folio) == HPAGE_PMD_ORDER && - folio->index == start - /* Maybe PMD-mapped */ - ? SCAN_PTE_MAPPED_HUGEPAGE - : SCAN_PAGE_COMPOUND; + if (folio_order(folio) == HPAGE_PMD_ORDER && + folio->index == start) { + /* Maybe PMD-mapped */ + result = SCAN_PTE_MAPPED_HUGEPAGE; /* * For SCAN_PTE_MAPPED_HUGEPAGE, further processing * by the caller won't touch the page cache, and so From 2e6d88e9d455fae9c4ac95c893362258f9540dfa Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Tue, 20 Aug 2024 17:49:17 +0800 Subject: [PATCH 213/416] selftests: mm: support shmem mTHP collapse testing Add shmem mTHP collpase testing. Similar to the anonymous page, users can use the '-s' parameter to specify the shmem mTHP size for testing. Link: https://lkml.kernel.org/r/fa44bfa20ca5b9fd6f9163a048f3d3c1e53cd0a8.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/khugepaged.c | 4 +- tools/testing/selftests/mm/thp_settings.c | 46 ++++++++++++++++++++--- tools/testing/selftests/mm/thp_settings.h | 9 ++++- 3 files changed, 51 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c index 829320a519e7..56d4480e8d3c 100644 --- a/tools/testing/selftests/mm/khugepaged.c +++ b/tools/testing/selftests/mm/khugepaged.c @@ -1095,7 +1095,7 @@ static void usage(void) fprintf(stderr, "\n\tSupported Options:\n"); fprintf(stderr, "\t\t-h: This help message.\n"); fprintf(stderr, "\t\t-s: mTHP size, expressed as page order.\n"); - fprintf(stderr, "\t\t Defaults to 0. Use this size for anon allocations.\n"); + fprintf(stderr, "\t\t Defaults to 0. Use this size for anon or shmem allocations.\n"); exit(1); } @@ -1209,6 +1209,8 @@ int main(int argc, char **argv) default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8; default_settings.hugepages[hpage_pmd_order].enabled = THP_INHERIT; default_settings.hugepages[anon_order].enabled = THP_ALWAYS; + default_settings.shmem_hugepages[hpage_pmd_order].enabled = SHMEM_INHERIT; + default_settings.shmem_hugepages[anon_order].enabled = SHMEM_ALWAYS; save_settings(); thp_push_settings(&default_settings); diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c index a4163438108e..577eaab6266f 100644 --- a/tools/testing/selftests/mm/thp_settings.c +++ b/tools/testing/selftests/mm/thp_settings.c @@ -33,10 +33,11 @@ static const char * const thp_defrag_strings[] = { }; static const char * const shmem_enabled_strings[] = { + "never", "always", "within_size", "advise", - "never", + "inherit", "deny", "force", NULL @@ -200,6 +201,7 @@ void thp_write_num(const char *name, unsigned long num) void thp_read_settings(struct thp_settings *settings) { unsigned long orders = thp_supported_orders(); + unsigned long shmem_orders = thp_shmem_supported_orders(); char path[PATH_MAX]; int i; @@ -234,12 +236,24 @@ void thp_read_settings(struct thp_settings *settings) settings->hugepages[i].enabled = thp_read_string(path, thp_enabled_strings); } + + for (i = 0; i < NR_ORDERS; i++) { + if (!((1 << i) & shmem_orders)) { + settings->shmem_hugepages[i].enabled = SHMEM_NEVER; + continue; + } + snprintf(path, PATH_MAX, "hugepages-%ukB/shmem_enabled", + (getpagesize() >> 10) << i); + settings->shmem_hugepages[i].enabled = + thp_read_string(path, shmem_enabled_strings); + } } void thp_write_settings(struct thp_settings *settings) { struct khugepaged_settings *khugepaged = &settings->khugepaged; unsigned long orders = thp_supported_orders(); + unsigned long shmem_orders = thp_shmem_supported_orders(); char path[PATH_MAX]; int enabled; int i; @@ -271,6 +285,15 @@ void thp_write_settings(struct thp_settings *settings) enabled = settings->hugepages[i].enabled; thp_write_string(path, thp_enabled_strings[enabled]); } + + for (i = 0; i < NR_ORDERS; i++) { + if (!((1 << i) & shmem_orders)) + continue; + snprintf(path, PATH_MAX, "hugepages-%ukB/shmem_enabled", + (getpagesize() >> 10) << i); + enabled = settings->shmem_hugepages[i].enabled; + thp_write_string(path, shmem_enabled_strings[enabled]); + } } struct thp_settings *thp_current_settings(void) @@ -324,17 +347,18 @@ void thp_set_read_ahead_path(char *path) dev_queue_read_ahead_path[sizeof(dev_queue_read_ahead_path) - 1] = '\0'; } -unsigned long thp_supported_orders(void) +static unsigned long __thp_supported_orders(bool is_shmem) { unsigned long orders = 0; char path[PATH_MAX]; char buf[256]; - int ret; - int i; + int ret, i; + char anon_dir[] = "enabled"; + char shmem_dir[] = "shmem_enabled"; for (i = 0; i < NR_ORDERS; i++) { - ret = snprintf(path, PATH_MAX, THP_SYSFS "hugepages-%ukB/enabled", - (getpagesize() >> 10) << i); + ret = snprintf(path, PATH_MAX, THP_SYSFS "hugepages-%ukB/%s", + (getpagesize() >> 10) << i, is_shmem ? shmem_dir : anon_dir); if (ret >= PATH_MAX) { printf("%s: Pathname is too long\n", __func__); exit(EXIT_FAILURE); @@ -347,3 +371,13 @@ unsigned long thp_supported_orders(void) return orders; } + +unsigned long thp_supported_orders(void) +{ + return __thp_supported_orders(false); +} + +unsigned long thp_shmem_supported_orders(void) +{ + return __thp_supported_orders(true); +} diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h index 71cbff05f4c7..876235a23460 100644 --- a/tools/testing/selftests/mm/thp_settings.h +++ b/tools/testing/selftests/mm/thp_settings.h @@ -22,10 +22,11 @@ enum thp_defrag { }; enum shmem_enabled { + SHMEM_NEVER, SHMEM_ALWAYS, SHMEM_WITHIN_SIZE, SHMEM_ADVISE, - SHMEM_NEVER, + SHMEM_INHERIT, SHMEM_DENY, SHMEM_FORCE, }; @@ -46,6 +47,10 @@ struct khugepaged_settings { unsigned long pages_to_scan; }; +struct shmem_hugepages_settings { + enum shmem_enabled enabled; +}; + struct thp_settings { enum thp_enabled thp_enabled; enum thp_defrag thp_defrag; @@ -54,6 +59,7 @@ struct thp_settings { struct khugepaged_settings khugepaged; unsigned long read_ahead_kb; struct hugepages_settings hugepages[NR_ORDERS]; + struct shmem_hugepages_settings shmem_hugepages[NR_ORDERS]; }; int read_file(const char *path, char *buf, size_t buflen); @@ -76,5 +82,6 @@ void thp_save_settings(void); void thp_set_read_ahead_path(char *path); unsigned long thp_supported_orders(void); +unsigned long thp_shmem_supported_orders(void); #endif /* __THP_SETTINGS_H__ */ From 49029c4db368518edac0409a40df9d852aac92f5 Mon Sep 17 00:00:00 2001 From: Thorsten Blum Date: Tue, 20 Aug 2024 06:22:55 +0200 Subject: [PATCH 214/416] mm: shrinker: use min() to improve shrinker_debugfs_scan_write() Use the min() macro to simplify the shrinker_debugfs_scan_write() function and improve its readability. Link: https://lkml.kernel.org/r/20240820042254.99115-2-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum Reviewed-by: Muchun Song Cc: Dave Chinner Cc: Qi Zheng Cc: Roman Gushchin Signed-off-by: Andrew Morton --- mm/shrinker_debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c index 12ea5486a3e9..4a85b94d12ce 100644 --- a/mm/shrinker_debug.c +++ b/mm/shrinker_debug.c @@ -114,7 +114,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file, int nid; char kbuf[72]; - read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1); + read_len = min(size, sizeof(kbuf) - 1); if (copy_from_user(kbuf, buf, read_len)) return -EFAULT; kbuf[read_len] = '\0'; From cd5f3193b432cd70cc1c19aba790300dd11ae934 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 20 Aug 2024 11:26:30 +0800 Subject: [PATCH 215/416] mm: remove migration for HugePage in isolate_single_pageblock() The gigantic page size may larger than memory block size, so memory offline always fails in this case after commit b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity"), offline_pages start_isolate_page_range start_isolate_page_range(isolate_before=true) isolate [isolate_start, isolate_start + pageblock_nr_pages) start_isolate_page_range(isolate_before=false) isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock __alloc_contig_migrate_range isolate_migratepages_range isolate_migratepages_block isolate_or_dissolve_huge_page if (hstate_is_gigantic(h)) return -ENOMEM; [ 15.815756] memory offlining [mem 0x3c0000000-0x3c7ffffff] failed due to failure to isolate range Gigantic PageHuge is bigger than a pageblock, but since it is freed as order-0 pages, its pageblocks after being freed will get to the right free list. There is no need to have special handling code for them in start_isolate_page_range(). For both alloc_contig_range() and memory offline cases, the migration code after start_isolate_page_range() will be able to migrate gigantic PageHuge when possible. Let's clean up start_isolate_page_range() and fix the aforementioned memory offline failure issue all together. Let's clean up start_isolate_page_range() and fix the aforementioned memory offline failure issue all together. Link: https://lkml.kernel.org/r/20240820032630.1894770-1-wangkefeng.wang@huawei.com Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Acked-by: Zi Yan Cc: Matthew Wilcox Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/page_isolation.c | 28 +++------------------------- 1 file changed, 3 insertions(+), 25 deletions(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 39fb8c07aeb7..7e04047977cf 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -403,30 +403,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, unsigned long head_pfn = page_to_pfn(head); unsigned long nr_pages = compound_nr(head); - if (head_pfn + nr_pages <= boundary_pfn) { - pfn = head_pfn + nr_pages; - continue; - } - -#if defined CONFIG_COMPACTION || defined CONFIG_CMA - if (PageHuge(page)) { - int page_mt = get_pageblock_migratetype(page); - struct compact_control cc = { - .nr_migratepages = 0, - .order = -1, - .zone = page_zone(pfn_to_page(head_pfn)), - .mode = MIGRATE_SYNC, - .ignore_skip_hint = true, - .no_set_skip_hint = true, - .gfp_mask = gfp_flags, - .alloc_contig = true, - }; - INIT_LIST_HEAD(&cc.migratepages); - - ret = __alloc_contig_migrate_range(&cc, head_pfn, - head_pfn + nr_pages, page_mt); - if (ret) - goto failed; + if (head_pfn + nr_pages <= boundary_pfn || + PageHuge(page)) { pfn = head_pfn + nr_pages; continue; } @@ -440,7 +418,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, */ VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page); -#endif + goto failed; } From 0a2d82946be67e02fdf85a4010606bdc0546ba44 Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Tue, 20 Aug 2024 10:26:39 +0800 Subject: [PATCH 216/416] mm: allow read-ahead with IOCB_NOWAIT set Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9 ("mm: allow read-ahead with IOCB_NOWAIT set"). However, this implementation broke the semantics of IOCB_NOWAIT by potentially causing it to wait on I/O during memory reclamation. This behavior was later modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO"). To resolve the blocking issue during memory reclamation, we can use memalloc_noio_{save,restore} to ensure non-blocking behavior. This change restores the original functionality, allowing preadv2(IOCB_NOWAIT) to trigger readahead if the file content is not present in the page cache. While this process may trigger direct memory reclamation, the __GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't block. A use case for this change is when we want to trigger readahead in the preadv2(2) syscall if the file cache is absent, but without waiting for certain filesystem locks, like xfs_ilock. A simple example is as follows: retry: if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { do_other_work(); goto retry; } Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/ Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao Cc: Jens Axboe Cc: Matthew Wilcox Cc: Dave Chinner Cc: Jan Kara Cc: Christian Brauner Signed-off-by: Andrew Morton --- include/linux/fs.h | 1 - mm/filemap.c | 6 ++++++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 6ca11e241a24..8610242ec2ab 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3461,7 +3461,6 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, if (flags & RWF_NOWAIT) { if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) return -EOPNOTSUPP; - kiocb_flags |= IOCB_NOIO; } if (flags & RWF_ATOMIC) { if (rw_type != WRITE) diff --git a/mm/filemap.c b/mm/filemap.c index fdaa1e5e3078..070dee9791a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -2519,6 +2520,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, pgoff_t index = iocb->ki_pos >> PAGE_SHIFT; pgoff_t last_index; struct folio *folio; + unsigned int flags; int err = 0; /* "last_index" is the index of the page beyond the end of the read */ @@ -2531,8 +2533,12 @@ retry: if (!folio_batch_count(fbatch)) { if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; + if (iocb->ki_flags & IOCB_NOWAIT) + flags = memalloc_noio_save(); page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); + if (iocb->ki_flags & IOCB_NOWAIT) + memalloc_noio_restore(flags); filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } if (!folio_batch_count(fbatch)) { From 4d1b3416659be70a2251b494e85e25978de06519 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:28 +0100 Subject: [PATCH 217/416] mm: move can_modify_vma to mm/vma.h Patch series "mm: Optimize mseal checks", v3. Optimize mseal checks by removing the separate can_modify_mm() step, and just doing checks on the individual vmas, when various operations are themselves iterating through the tree. This provides a nice speedup and restores performance parity with pre-mseal[3]. will-it-scale mmap1_process[1] -t 1 results: commit 3450fe2b574b4345e4296ccae395149e1a357fee: min:277605 max:277605 total:277605 min:281784 max:281784 total:281784 min:277238 max:277238 total:277238 min:281761 max:281761 total:281761 min:274279 max:274279 total:274279 min:254854 max:254854 total:254854 measurement min:269143 max:269143 total:269143 min:270454 max:270454 total:270454 min:243523 max:243523 total:243523 min:251148 max:251148 total:251148 min:209669 max:209669 total:209669 min:190426 max:190426 total:190426 min:231219 max:231219 total:231219 min:275364 max:275364 total:275364 min:266540 max:266540 total:266540 min:242572 max:242572 total:242572 min:284469 max:284469 total:284469 min:278882 max:278882 total:278882 min:283269 max:283269 total:283269 min:281204 max:281204 total:281204 After this patch set: min:280580 max:280580 total:280580 min:290514 max:290514 total:290514 min:291006 max:291006 total:291006 min:290352 max:290352 total:290352 min:294582 max:294582 total:294582 min:293075 max:293075 total:293075 measurement min:295613 max:295613 total:295613 min:294070 max:294070 total:294070 min:293193 max:293193 total:293193 min:291631 max:291631 total:291631 min:295278 max:295278 total:295278 min:293782 max:293782 total:293782 min:290361 max:290361 total:290361 min:294517 max:294517 total:294517 min:293750 max:293750 total:293750 min:293572 max:293572 total:293572 min:295239 max:295239 total:295239 min:292932 max:292932 total:292932 min:293319 max:293319 total:293319 min:294954 max:294954 total:294954 This was a Completely Unscientific test but seems to show there were around 5-10% gains on ops per second. Oliver performed his own tests and showed[3] a similar ~5% gain in them. [1]: mmap1_process does mmap and munmap in a loop. I didn't bother testing multithreading cases. [2]: https://lore.kernel.org/all/20240807124103.85644-1-mpe@ellerman.id.au/ [3]: https://lore.kernel.org/all/ZrMMJfe9aXSWxJz6@xsang-OptiPlex-9020/ Link: https://lore.kernel.org/all/202408041602.caa0372-oliver.sang@intel.com/ This patch (of 7): Move can_modify_vma to vma.h so it can be inlined properly (with the intent to remove can_modify_mm callsites). Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-1-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Pedro Falcato Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mseal.c | 17 ----------------- mm/vma.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 17 deletions(-) diff --git a/mm/mseal.c b/mm/mseal.c index 15bba28acc00..2170e2139ca0 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -16,28 +16,11 @@ #include #include "internal.h" -static inline bool vma_is_sealed(struct vm_area_struct *vma) -{ - return (vma->vm_flags & VM_SEALED); -} - static inline void set_vma_sealed(struct vm_area_struct *vma) { vm_flags_set(vma, VM_SEALED); } -/* - * check if a vma is sealed for modification. - * return true, if modification is allowed. - */ -static bool can_modify_vma(struct vm_area_struct *vma) -{ - if (unlikely(vma_is_sealed(vma))) - return false; - - return true; -} - static bool is_madv_discard(int behavior) { switch (behavior) { diff --git a/mm/vma.h b/mm/vma.h index 6efdf1768a0a..e979015cc7fc 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -361,4 +361,32 @@ struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) return mas_prev_range(&vmi->mas, 0); } +#ifdef CONFIG_64BIT + +static inline bool vma_is_sealed(struct vm_area_struct *vma) +{ + return (vma->vm_flags & VM_SEALED); +} + +/* + * check if a vma is sealed for modification. + * return true, if modification is allowed. + */ +static inline bool can_modify_vma(struct vm_area_struct *vma) +{ + if (unlikely(vma_is_sealed(vma))) + return false; + + return true; +} + +#else + +static inline bool can_modify_vma(struct vm_area_struct *vma) +{ + return true; +} + +#endif + #endif /* __MM_VMA_H */ From df2a7df9a9aa32c3df227de346693e6e802c8591 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:29 +0100 Subject: [PATCH 218/416] mm/munmap: replace can_modify_mm with can_modify_vma We were doing an extra mmap tree traversal just to check if the entire range is modifiable. This can be done when we iterate through the VMAs instead. Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-2-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett LGTM, Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 11 +---------- mm/vma.c | 19 ++++++++++++------- 2 files changed, 13 insertions(+), 17 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 7cba84f8e3a5..548bc45a27bf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1740,16 +1740,7 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct list_head *uf, bool unlock) { - struct mm_struct *mm = vma->vm_mm; - - /* - * Check if memory is sealed, prevent unmapping a sealed VMA. - * can_modify_mm assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm(mm, start, end))) - return -EPERM; - - return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); + return do_vmi_align_munmap(vmi, vma, vma->vm_mm, start, end, uf, unlock); } /* diff --git a/mm/vma.c b/mm/vma.c index 84965f2cd580..5850f7c0949b 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -712,6 +712,12 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) goto map_count_exceeded; + /* Don't bother splitting the VMA if we can't unmap it anyway */ + if (!can_modify_vma(vma)) { + error = -EPERM; + goto start_split_failed; + } + error = __split_vma(vmi, vma, start, 1); if (error) goto start_split_failed; @@ -723,6 +729,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, */ next = vma; do { + if (!can_modify_vma(next)) { + error = -EPERM; + goto modify_vma_failed; + } + /* Does it split the end? */ if (next->vm_end > end) { error = __split_vma(vmi, next, end, 0); @@ -815,6 +826,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, __mt_destroy(&mt_detach); return 0; +modify_vma_failed: clear_tree_failed: userfaultfd_error: munmap_gather_failed: @@ -860,13 +872,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, if (end == start) return -EINVAL; - /* - * Check if memory is sealed, prevent unmapping a sealed VMA. - * can_modify_mm assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm(mm, start, end))) - return -EPERM; - /* Find the first overlapping VMA */ vma = vma_find(vmi, end); if (!vma) { From 4a2dd02b09160ee43f96c759fafa7b56dfc33816 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:30 +0100 Subject: [PATCH 219/416] mm/mprotect: replace can_modify_mm with can_modify_vma Avoid taking an extra trip down the mmap tree by checking the vmas directly. mprotect (per POSIX) tolerates partial failure. Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-3-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mprotect.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 446f8e5f10d9..0c5d6d06107d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -611,6 +611,9 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, unsigned long charged = 0; int error; + if (!can_modify_vma(vma)) + return -EPERM; + if (newflags == oldflags) { *pprev = vma; return 0; @@ -769,15 +772,6 @@ static int do_mprotect_pkey(unsigned long start, size_t len, } } - /* - * checking if memory is sealed. - * can_modify_mm assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm(current->mm, start, end))) { - error = -EPERM; - goto out; - } - prev = vma_prev(&vmi); if (start > vma->vm_start) prev = vma; From 38075679b5f157eeacd46c900e9cfc684bdbc167 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:31 +0100 Subject: [PATCH 220/416] mm/mremap: replace can_modify_mm with can_modify_vma Delegate all can_modify checks to the proper places. Unmap checks are done in do_unmap (et al). The source VMA check is done purposefully before unmapping, to keep the original mseal semantics. Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-4-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mremap.c | 32 ++++++-------------------------- 1 file changed, 6 insertions(+), 26 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index e7ae140fc640..24712f8dbb6b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -902,19 +902,6 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if ((mm->map_count + 2) >= sysctl_max_map_count - 3) return -ENOMEM; - /* - * In mremap_to(). - * Move a VMA to another location, check if src addr is sealed. - * - * Place can_modify_mm here because mremap_to() - * does its own checking for address range, and we only - * check the sealing after passing those checks. - * - * can_modify_mm assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm(mm, addr, addr + old_len))) - return -EPERM; - if (flags & MREMAP_FIXED) { /* * In mremap_to(). @@ -1052,6 +1039,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; } + /* Don't allow remapping vmas when they have already been sealed */ + if (!can_modify_vma(vma)) { + ret = -EPERM; + goto out; + } + if (is_vm_hugetlb_page(vma)) { struct hstate *h __maybe_unused = hstate_vma(vma); @@ -1079,19 +1072,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; } - /* - * Below is shrink/expand case (not mremap_to()) - * Check if src address is sealed, if so, reject. - * In other words, prevent shrinking or expanding a sealed VMA. - * - * Place can_modify_mm here so we can keep the logic related to - * shrink/expand together. - */ - if (unlikely(!can_modify_mm(mm, addr, addr + old_len))) { - ret = -EPERM; - goto out; - } - /* * Always allow a shrinking remap: that just unmaps * the unnecessary pages.. From 23c57d1fa2b9530e38f7964b4e457fed5a7a0ae8 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:32 +0100 Subject: [PATCH 221/416] mseal: replace can_modify_mm_madv with a vma variant Replace can_modify_mm_madv() with a single vma variant, and associated checks in madvise. While we're at it, also invert the order of checks in: if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma)) Checking if we can modify the vma itself (through vm_flags) is certainly cheaper than is_ro_anon() due to arch_vma_access_permitted() looking at e.g pkeys registers (with extra branches) in some architectures. This patch allows for partial madvise success when finding a sealed VMA, which historically has been allowed in Linux. Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-5-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/internal.h | 2 -- mm/madvise.c | 13 +++---------- mm/mseal.c | 17 ++++------------- mm/vma.h | 7 +++++++ 4 files changed, 14 insertions(+), 25 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index cbe4849f6e73..241cf46a19f5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1361,8 +1361,6 @@ static inline int can_do_mseal(unsigned long flags) bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end); -bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, - unsigned long end, int behavior); #else static inline int can_do_mseal(unsigned long flags) { diff --git a/mm/madvise.c b/mm/madvise.c index 89089d84f8df..4e64770be16c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1031,6 +1031,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, struct anon_vma_name *anon_name; unsigned long new_flags = vma->vm_flags; + if (unlikely(!can_modify_vma_madv(vma, behavior))) + return -EPERM; + switch (behavior) { case MADV_REMOVE: return madvise_remove(vma, prev, start, end); @@ -1448,15 +1451,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh start = untagged_addr_remote(mm, start); end = start + len; - /* - * Check if the address range is sealed for do_madvise(). - * can_modify_mm_madv assumes we have acquired the lock on MM. - */ - if (unlikely(!can_modify_mm_madv(mm, start, end, behavior))) { - error = -EPERM; - goto out; - } - blk_start_plug(&plug); switch (behavior) { case MADV_POPULATE_READ: @@ -1470,7 +1464,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh } blk_finish_plug(&plug); -out: if (write) mmap_write_unlock(mm); else diff --git a/mm/mseal.c b/mm/mseal.c index 2170e2139ca0..fdd1666344fa 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -75,24 +75,15 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end) } /* - * Check if the vmas of a memory range are allowed to be modified by madvise. - * the memory ranger can have a gap (unallocated memory). - * return true, if it is allowed. + * Check if a vma is allowed to be modified by madvise. */ -bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end, - int behavior) +bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior) { - struct vm_area_struct *vma; - - VMA_ITERATOR(vmi, mm, start); - if (!is_madv_discard(behavior)) return true; - /* going through each vma to check. */ - for_each_vma_range(vmi, vma, end) - if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma))) - return false; + if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma))) + return false; /* Allow by default. */ return true; diff --git a/mm/vma.h b/mm/vma.h index e979015cc7fc..da31d0f62157 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -380,6 +380,8 @@ static inline bool can_modify_vma(struct vm_area_struct *vma) return true; } +bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior); + #else static inline bool can_modify_vma(struct vm_area_struct *vma) @@ -387,6 +389,11 @@ static inline bool can_modify_vma(struct vm_area_struct *vma) return true; } +static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior) +{ + return true; +} + #endif #endif /* __MM_VMA_H */ From 5b3db2b812a1f86dfab587324d198a5d10c7a5cf Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:33 +0100 Subject: [PATCH 222/416] mm: remove can_modify_mm() With no more users in the tree, we can finally remove can_modify_mm(). Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-6-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/internal.h | 14 -------------- mm/mseal.c | 21 --------------------- 2 files changed, 35 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 241cf46a19f5..82fb9f1b0cd4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1359,25 +1359,11 @@ static inline int can_do_mseal(unsigned long flags) return 0; } -bool can_modify_mm(struct mm_struct *mm, unsigned long start, - unsigned long end); #else static inline int can_do_mseal(unsigned long flags) { return -EPERM; } - -static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start, - unsigned long end) -{ - return true; -} - -static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, - unsigned long end, int behavior) -{ - return true; -} #endif #ifdef CONFIG_SHRINKER_DEBUG diff --git a/mm/mseal.c b/mm/mseal.c index fdd1666344fa..28cd17d7aaf2 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -53,27 +53,6 @@ static bool is_ro_anon(struct vm_area_struct *vma) return false; } -/* - * Check if the vmas of a memory range are allowed to be modified. - * the memory ranger can have a gap (unallocated memory). - * return true, if it is allowed. - */ -bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end) -{ - struct vm_area_struct *vma; - - VMA_ITERATOR(vmi, mm, start); - - /* going through each vma to check. */ - for_each_vma_range(vmi, vma, end) { - if (unlikely(!can_modify_vma(vma))) - return false; - } - - /* Allow by default. */ - return true; -} - /* * Check if a vma is allowed to be modified by madvise. */ From f28bdd1b17ec187eaa34845814afaaff99832762 Mon Sep 17 00:00:00 2001 From: Pedro Falcato Date: Sat, 17 Aug 2024 01:18:34 +0100 Subject: [PATCH 223/416] selftests/mm: add more mseal traversal tests Add more mseal traversal tests across VMAs, where we could possibly screw up sealing checks. These test more across-vma traversal for mprotect, munmap and madvise. Particularly, we test for the case where a regular VMA is followed by a sealed VMA. [akpm@linux-foundation.org: remove incorrect comment, per review] [akpm@linux-foundation.org: remove the correct comment, per Pedro] [pedro.falcato@gmail.com: fix mseal's length] Link: https://lkml.kernel.org/r/vc4czyuemmu3kylqb4ctaga6y5yvondlyabimx6jvljlw2fkea@djawlllf45xa Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-7-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Jeff Xu Cc: Kees Cook Cc: Linus Torvalds Cc: Michael Ellerman Cc: Shuah Khan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/mseal_test.c | 106 +++++++++++++++++++++++- 1 file changed, 105 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c index bd0bfda7aae7..e7ed720e9677 100644 --- a/tools/testing/selftests/mm/mseal_test.c +++ b/tools/testing/selftests/mm/mseal_test.c @@ -777,6 +777,42 @@ static void test_seal_mprotect_partial_mprotect(bool seal) REPORT_TEST_PASS(); } +static void test_seal_mprotect_partial_mprotect_tail(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 2 * page_size; + int ret; + int prot; + + /* + * Check if a partial mseal (that results in two vmas) works correctly. + * It might mprotect the first, but it'll never touch the second (msealed) vma. + */ + + setup_single_address(size, &ptr); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + ret = sys_mprotect(ptr, size, PROT_EXEC); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + FAIL_TEST_IF_FALSE(get_vma_size(ptr + page_size, &prot) > 0); + FAIL_TEST_IF_FALSE(prot == 0x4); + } + + REPORT_TEST_PASS(); +} + + static void test_seal_mprotect_two_vma_with_gap(bool seal) { void *ptr; @@ -994,6 +1030,36 @@ static void test_seal_munmap_vma_with_gap(bool seal) REPORT_TEST_PASS(); } +static void test_seal_munmap_partial_across_vmas(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 2 * page_size; + int ret; + int prot; + + setup_single_address(size, &ptr); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + FAIL_TEST_IF_FALSE(get_vma_size(ptr + page_size, &prot) > 0); + FAIL_TEST_IF_FALSE(prot == 0x4); + } + + REPORT_TEST_PASS(); +} + static void test_munmap_start_freed(bool seal) { void *ptr; @@ -1746,6 +1812,37 @@ static void test_seal_discard_ro_anon(bool seal) REPORT_TEST_PASS(); } +static void test_seal_discard_across_vmas(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 2 * page_size; + int ret; + + setup_single_address(size, &ptr); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = seal_single_address(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + REPORT_TEST_PASS(); +} + + static void test_seal_madvise_nodiscard(bool seal) { void *ptr; @@ -1790,7 +1887,7 @@ int main(int argc, char **argv) if (!pkey_supported()) ksft_print_msg("PKEY not supported\n"); - ksft_set_plan(82); + ksft_set_plan(88); test_seal_addseal(); test_seal_unmapped_start(); @@ -1836,12 +1933,17 @@ int main(int argc, char **argv) test_seal_mprotect_split(false); test_seal_mprotect_split(true); + test_seal_mprotect_partial_mprotect_tail(false); + test_seal_mprotect_partial_mprotect_tail(true); + test_seal_munmap(false); test_seal_munmap(true); test_seal_munmap_two_vma(false); test_seal_munmap_two_vma(true); test_seal_munmap_vma_with_gap(false); test_seal_munmap_vma_with_gap(true); + test_seal_munmap_partial_across_vmas(false); + test_seal_munmap_partial_across_vmas(true); test_munmap_start_freed(false); test_munmap_start_freed(true); @@ -1873,6 +1975,8 @@ int main(int argc, char **argv) test_seal_madvise_nodiscard(true); test_seal_discard_ro_anon(false); test_seal_discard_ro_anon(true); + test_seal_discard_across_vmas(false); + test_seal_discard_across_vmas(true); test_seal_discard_ro_anon_on_rw(false); test_seal_discard_ro_anon_on_rw(true); test_seal_discard_ro_anon_on_shared(false); From e27ad6560e4b5993315b56d6884ca5a4652468f4 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 18:39:09 +0100 Subject: [PATCH 224/416] printf: remove %pGt support Patch series "Increase the number of bits available in page_type". Kent wants more than 16 bits in page_type, so I resurrected this old patch and expanded it a bit. It's a bit more efficient than our current scheme (1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type. This patch (of 4): An upcoming patch will convert page type from being a bitfield to a single byte, so we will not be able to use %pG to print the page type any more. The printing of the symbolic name will be restored in that patch. Link: https://lkml.kernel.org/r/20240821173914.2270383-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240821173914.2270383-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet Signed-off-by: Andrew Morton --- Documentation/core-api/printk-formats.rst | 4 +--- include/trace/events/mmflags.h | 10 --------- lib/test_printf.c | 26 ----------------------- lib/vsprintf.c | 21 ------------------ mm/debug.c | 2 +- mm/internal.h | 1 - 6 files changed, 2 insertions(+), 62 deletions(-) diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 4451ef501936..14e093da3ccd 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -576,13 +576,12 @@ The field width is passed by value, the bitmap is passed by reference. Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease printing cpumask and nodemask. -Flags bitfields such as page flags, page_type, gfp_flags +Flags bitfields such as page flags and gfp_flags -------------------------------------------------------- :: %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff) - %pGt 0xffffff7f(buddy) %pGg GFP_USER|GFP_DMA32|GFP_NOWARN %pGv read|exec|mayread|maywrite|mayexec|denywrite @@ -591,7 +590,6 @@ would construct the value. The type of flags is given by the third character. Currently supported are: - p - [p]age flags, expects value of type (``unsigned long *``) - - t - page [t]ype, expects value of type (``unsigned int *``) - v - [v]ma_flags, expects value of type (``unsigned long *``) - g - [g]fp_flags, expects value of type (``gfp_t *``) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index b5c4da370a50..c151cc21d367 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -130,16 +130,6 @@ IF_HAVE_PG_ARCH_X(arch_3) __def_pageflag_names \ ) : "none" -#define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) } - -#define __def_pagetype_names \ - DEF_PAGETYPE_NAME(slab), \ - DEF_PAGETYPE_NAME(hugetlb), \ - DEF_PAGETYPE_NAME(offline), \ - DEF_PAGETYPE_NAME(guard), \ - DEF_PAGETYPE_NAME(table), \ - DEF_PAGETYPE_NAME(buddy) - #if defined(CONFIG_X86) #define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" } #elif defined(CONFIG_PPC) diff --git a/lib/test_printf.c b/lib/test_printf.c index 965cb6f28527..8448b6d02bd9 100644 --- a/lib/test_printf.c +++ b/lib/test_printf.c @@ -641,26 +641,12 @@ page_flags_test(int section, int node, int zone, int last_cpupid, test(cmp_buf, "%pGp", &flags); } -static void __init page_type_test(unsigned int page_type, const char *name, - char *cmp_buf) -{ - unsigned long size; - - size = scnprintf(cmp_buf, BUF_SIZE, "%#x(", page_type); - if (page_type_has_type(page_type)) - size += scnprintf(cmp_buf + size, BUF_SIZE - size, "%s", name); - - snprintf(cmp_buf + size, BUF_SIZE - size, ")"); - test(cmp_buf, "%pGt", &page_type); -} - static void __init flags(void) { unsigned long flags; char *cmp_buffer; gfp_t gfp; - unsigned int page_type; cmp_buffer = kmalloc(BUF_SIZE, GFP_KERNEL); if (!cmp_buffer) @@ -700,18 +686,6 @@ flags(void) gfp |= __GFP_HIGH; test(cmp_buffer, "%pGg", &gfp); - page_type = ~0; - page_type_test(page_type, "", cmp_buffer); - - page_type = 10; - page_type_test(page_type, "", cmp_buffer); - - page_type = ~PG_buddy; - page_type_test(page_type, "buddy", cmp_buffer); - - page_type = ~(PG_table | PG_buddy); - page_type_test(page_type, "table|buddy", cmp_buffer); - kfree(cmp_buffer); } diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 2d71b1115916..09f022ba1c05 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2054,25 +2054,6 @@ char *format_page_flags(char *buf, char *end, unsigned long flags) return buf; } -static -char *format_page_type(char *buf, char *end, unsigned int page_type) -{ - buf = number(buf, end, page_type, default_flag_spec); - - if (buf < end) - *buf = '('; - buf++; - - if (page_type_has_type(page_type)) - buf = format_flags(buf, end, ~page_type, pagetype_names); - - if (buf < end) - *buf = ')'; - buf++; - - return buf; -} - static noinline_for_stack char *flags_string(char *buf, char *end, void *flags_ptr, struct printf_spec spec, const char *fmt) @@ -2086,8 +2067,6 @@ char *flags_string(char *buf, char *end, void *flags_ptr, switch (fmt[1]) { case 'p': return format_page_flags(buf, end, *(unsigned long *)flags_ptr); - case 't': - return format_page_type(buf, end, *(unsigned int *)flags_ptr); case 'v': flags = *(unsigned long *)flags_ptr; names = vmaflag_names; diff --git a/mm/debug.c b/mm/debug.c index 69e524c3e601..9f8e34537957 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -92,7 +92,7 @@ static void __dump_folio(struct folio *folio, struct page *page, pr_warn("%sflags: %pGp%s\n", type, &folio->flags, is_migrate_cma_folio(folio, pfn) ? " CMA" : ""); if (page_has_type(&folio->page)) - pr_warn("page_type: %pGt\n", &folio->page.page_type); + pr_warn("page_type: %x\n", folio->page.page_type); print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, sizeof(unsigned long), page, diff --git a/mm/internal.h b/mm/internal.h index 82fb9f1b0cd4..5e6f2abcea28 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1144,7 +1144,6 @@ static inline void flush_tlb_batched_pending(struct mm_struct *mm) #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; -extern const struct trace_print_flags pagetype_names[]; extern const struct trace_print_flags vmaflag_names[]; extern const struct trace_print_flags gfpflag_names[]; From e880034cf7182aa35962bd30dfc9fa60476d56a4 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 18:39:10 +0100 Subject: [PATCH 225/416] mm: introduce page_mapcount_is_type() Resolve the awkward "and add one to this opaque constant" test into a self-documenting inline function. Link: https://lkml.kernel.org/r/20240821173914.2270383-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet Signed-off-by: Andrew Morton --- fs/proc/internal.h | 3 +-- include/linux/mm.h | 3 +-- include/linux/page-flags.h | 12 +++++++++--- 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index a8a8576d8592..cc520168f8b6 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -166,8 +166,7 @@ static inline int folio_precise_page_mapcount(struct folio *folio, { int mapcount = atomic_read(&page->_mapcount) + 1; - /* Handle page_has_type() pages */ - if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) + if (page_mapcount_is_type(mapcount)) mapcount = 0; if (folio_test_large(folio)) mapcount += folio_entire_mapcount(folio); diff --git a/include/linux/mm.h b/include/linux/mm.h index 0fde51c0532f..6f6053d7ea6f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1228,8 +1228,7 @@ static inline int folio_mapcount(const struct folio *folio) if (likely(!folio_test_large(folio))) { mapcount = atomic_read(&folio->_mapcount) + 1; - /* Handle page_has_type() pages */ - if (mapcount < PAGE_MAPCOUNT_RESERVE + 1) + if (page_mapcount_is_type(mapcount)) mapcount = 0; return mapcount; } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 2515ae85f191..0b5484fdbca1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -956,12 +956,18 @@ enum pagetype { #define folio_test_type(folio, flag) \ ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) -static inline int page_type_has_type(unsigned int page_type) +static inline bool page_type_has_type(int page_type) { - return (int)page_type < PAGE_MAPCOUNT_RESERVE; + return page_type < PAGE_MAPCOUNT_RESERVE; } -static inline int page_has_type(const struct page *page) +/* This takes a mapcount which is one more than page->_mapcount */ +static inline bool page_mapcount_is_type(unsigned int mapcount) +{ + return page_type_has_type(mapcount - 1); +} + +static inline bool page_has_type(const struct page *page) { return page_type_has_type(READ_ONCE(page->page_type)); } From 4ffca5a96678c9fe1a39ec1e9c8fd326b57ea8a7 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 18:39:11 +0100 Subject: [PATCH 226/416] mm: support only one page_type per page By using a few values in the top byte, users of page_type can store up to 24 bits of additional data in page_type. It also reduces the code size as (with replacement of READ_ONCE() with data_race()), the kernel can check just a single byte. eg: ffffffff811e3a79: 8b 47 30 mov 0x30(%rdi),%eax ffffffff811e3a7c: 55 push %rbp ffffffff811e3a7d: 48 89 e5 mov %rsp,%rbp ffffffff811e3a80: 25 00 00 00 82 and $0x82000000,%eax ffffffff811e3a85: 3d 00 00 00 80 cmp $0x80000000,%eax ffffffff811e3a8a: 74 4d je ffffffff811e3ad9 becomes: ffffffff811e3a69: 80 7f 33 f5 cmpb $0xf5,0x33(%rdi) ffffffff811e3a6d: 55 push %rbp ffffffff811e3a6e: 48 89 e5 mov %rsp,%rbp ffffffff811e3a71: 74 4d je ffffffff811e3ac0 replacing three instructions with one. [wangkefeng.wang@huawei.com: fix ubsan warnings] Link: https://lkml.kernel.org/r/2d19c48a-c550-4345-bf36-d05cd303c5de@huawei.com Link: https://lkml.kernel.org/r/20240821173914.2270383-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 66 ++++++++++++++++---------------------- kernel/vmcore_info.c | 8 ++--- mm/debug.c | 31 ++++++++++++++---- 3 files changed, 55 insertions(+), 50 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 0b5484fdbca1..7a1b2948b075 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -923,42 +923,29 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) #endif /* - * For pages that are never mapped to userspace, - * page_type may be used. Because it is initialised to -1, we invert the - * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and - * __ClearPageFoo *sets* the bit used for PageFoo. We reserve a few high and - * low bits so that an underflow or overflow of _mapcount won't be - * mistaken for a page type value. + * For pages that do not use mapcount, page_type may be used. + * The low 24 bits of pagetype may be used for your own purposes, as long + * as you are careful to not affect the top 8 bits. The low bits of + * pagetype will be overwritten when you clear the page_type from the page. */ - enum pagetype { - PG_buddy = 0x40000000, - PG_offline = 0x20000000, - PG_table = 0x10000000, - PG_guard = 0x08000000, - PG_hugetlb = 0x04000000, - PG_slab = 0x02000000, - PG_zsmalloc = 0x01000000, - PG_unaccepted = 0x00800000, + /* 0x00-0x7f are positive numbers, ie mapcount */ + /* Reserve 0x80-0xef for mapcount overflow. */ + PGTY_buddy = 0xf0, + PGTY_offline = 0xf1, + PGTY_table = 0xf2, + PGTY_guard = 0xf3, + PGTY_hugetlb = 0xf4, + PGTY_slab = 0xf5, + PGTY_zsmalloc = 0xf6, + PGTY_unaccepted = 0xf7, - PAGE_TYPE_BASE = 0x80000000, - - /* - * Reserve 0xffff0000 - 0xfffffffe to catch _mapcount underflows and - * allow owners that set a type to reuse the lower 16 bit for their own - * purposes. - */ - PAGE_MAPCOUNT_RESERVE = ~0x0000ffff, + PGTY_mapcount_underflow = 0xff }; -#define PageType(page, flag) \ - ((READ_ONCE(page->page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) -#define folio_test_type(folio, flag) \ - ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) - static inline bool page_type_has_type(int page_type) { - return page_type < PAGE_MAPCOUNT_RESERVE; + return page_type < (PGTY_mapcount_underflow << 24); } /* This takes a mapcount which is one more than page->_mapcount */ @@ -969,40 +956,41 @@ static inline bool page_mapcount_is_type(unsigned int mapcount) static inline bool page_has_type(const struct page *page) { - return page_type_has_type(READ_ONCE(page->page_type)); + return page_mapcount_is_type(data_race(page->page_type)); } #define FOLIO_TYPE_OPS(lname, fname) \ -static __always_inline bool folio_test_##fname(const struct folio *folio)\ +static __always_inline bool folio_test_##fname(const struct folio *folio) \ { \ - return folio_test_type(folio, PG_##lname); \ + return data_race(folio->page.page_type >> 24) == PGTY_##lname; \ } \ static __always_inline void __folio_set_##fname(struct folio *folio) \ { \ - VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \ - folio->page.page_type &= ~PG_##lname; \ + VM_BUG_ON_FOLIO(data_race(folio->page.page_type) != UINT_MAX, \ + folio); \ + folio->page.page_type = (unsigned int)PGTY_##lname << 24; \ } \ static __always_inline void __folio_clear_##fname(struct folio *folio) \ { \ VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \ - folio->page.page_type |= PG_##lname; \ + folio->page.page_type = UINT_MAX; \ } #define PAGE_TYPE_OPS(uname, lname, fname) \ FOLIO_TYPE_OPS(lname, fname) \ static __always_inline int Page##uname(const struct page *page) \ { \ - return PageType(page, PG_##lname); \ + return data_race(page->page_type >> 24) == PGTY_##lname; \ } \ static __always_inline void __SetPage##uname(struct page *page) \ { \ - VM_BUG_ON_PAGE(!PageType(page, 0), page); \ - page->page_type &= ~PG_##lname; \ + VM_BUG_ON_PAGE(data_race(page->page_type) != UINT_MAX, page); \ + page->page_type = (unsigned int)PGTY_##lname << 24; \ } \ static __always_inline void __ClearPage##uname(struct page *page) \ { \ VM_BUG_ON_PAGE(!Page##uname(page), page); \ - page->page_type |= PG_##lname; \ + page->page_type = UINT_MAX; \ } /* diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c index 8b4f8cc2e0ec..1fec61603ef3 100644 --- a/kernel/vmcore_info.c +++ b/kernel/vmcore_info.c @@ -198,17 +198,17 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PG_private); VMCOREINFO_NUMBER(PG_swapcache); VMCOREINFO_NUMBER(PG_swapbacked); -#define PAGE_SLAB_MAPCOUNT_VALUE (~PG_slab) +#define PAGE_SLAB_MAPCOUNT_VALUE (PGTY_slab << 24) VMCOREINFO_NUMBER(PAGE_SLAB_MAPCOUNT_VALUE); #ifdef CONFIG_MEMORY_FAILURE VMCOREINFO_NUMBER(PG_hwpoison); #endif VMCOREINFO_NUMBER(PG_head_mask); -#define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy) +#define PAGE_BUDDY_MAPCOUNT_VALUE (PGTY_buddy << 24) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); -#define PAGE_HUGETLB_MAPCOUNT_VALUE (~PG_hugetlb) +#define PAGE_HUGETLB_MAPCOUNT_VALUE (PGTY_hugetlb << 24) VMCOREINFO_NUMBER(PAGE_HUGETLB_MAPCOUNT_VALUE); -#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) +#define PAGE_OFFLINE_MAPCOUNT_VALUE (PGTY_offline << 24) VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); #ifdef CONFIG_KALLSYMS diff --git a/mm/debug.c b/mm/debug.c index 9f8e34537957..aa57d3ffd4ed 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -36,11 +36,6 @@ const struct trace_print_flags pageflag_names[] = { {0, NULL} }; -const struct trace_print_flags pagetype_names[] = { - __def_pagetype_names, - {0, NULL} -}; - const struct trace_print_flags gfpflag_names[] = { __def_gfpflag_names, {0, NULL} @@ -51,6 +46,27 @@ const struct trace_print_flags vmaflag_names[] = { {0, NULL} }; +#define DEF_PAGETYPE_NAME(_name) [PGTY_##_name - 0xf0] = __stringify(_name) + +static const char *page_type_names[] = { + DEF_PAGETYPE_NAME(slab), + DEF_PAGETYPE_NAME(hugetlb), + DEF_PAGETYPE_NAME(offline), + DEF_PAGETYPE_NAME(guard), + DEF_PAGETYPE_NAME(table), + DEF_PAGETYPE_NAME(buddy), + DEF_PAGETYPE_NAME(unaccepted), +}; + +static const char *page_type_name(unsigned int page_type) +{ + unsigned i = (page_type >> 24) - 0xf0; + + if (i >= ARRAY_SIZE(page_type_names)) + return "unknown"; + return page_type_names[i]; +} + static void __dump_folio(struct folio *folio, struct page *page, unsigned long pfn, unsigned long idx) { @@ -58,7 +74,7 @@ static void __dump_folio(struct folio *folio, struct page *page, int mapcount = atomic_read(&page->_mapcount); char *type = ""; - mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1; + mapcount = page_mapcount_is_type(mapcount) ? 0 : mapcount + 1; pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", folio_ref_count(folio), mapcount, mapping, folio->index + idx, pfn); @@ -92,7 +108,8 @@ static void __dump_folio(struct folio *folio, struct page *page, pr_warn("%sflags: %pGp%s\n", type, &folio->flags, is_migrate_cma_folio(folio, pfn) ? " CMA" : ""); if (page_has_type(&folio->page)) - pr_warn("page_type: %x\n", folio->page.page_type); + pr_warn("page_type: %x(%s)\n", folio->page.page_type >> 24, + page_type_name(folio->page.page_type)); print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, sizeof(unsigned long), page, From 04cb7502a5d70a6ca230e8c24835dc7dccd39fa7 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 18:39:12 +0100 Subject: [PATCH 227/416] zsmalloc: use all available 24 bits of page_type Now that we have an extra 8 bits, we don't need to limit ourselves to supporting a 64KiB page size. I'm sure both Hexagon users are grateful, but it does reduce complexity a little. We can also remove reset_first_obj_offset() as calling __ClearPageZsmalloc() will now reset all 32 bits of page_type. Link: https://lkml.kernel.org/r/20240821173914.2270383-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 1 - mm/Kconfig | 10 ++-------- mm/zsmalloc.c | 15 ++++----------- 3 files changed, 6 insertions(+), 20 deletions(-) diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index eacf1cba7bf4..7b29cce60ab2 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -2,7 +2,6 @@ config ZRAM tristate "Compressed RAM block device support" depends on BLOCK && SYSFS && MMU - depends on HAVE_ZSMALLOC depends on CRYPTO_LZO || CRYPTO_ZSTD || CRYPTO_LZ4 || CRYPTO_LZ4HC || CRYPTO_842 select ZSMALLOC help diff --git a/mm/Kconfig b/mm/Kconfig index 3936fe4d26d9..5946dcdcaeda 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -128,7 +128,7 @@ config ZSWAP_COMPRESSOR_DEFAULT choice prompt "Default allocator" depends on ZSWAP - default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if HAVE_ZSMALLOC + default ZSWAP_ZPOOL_DEFAULT_ZSMALLOC if MMU default ZSWAP_ZPOOL_DEFAULT_ZBUD help Selects the default allocator for the compressed cache for @@ -154,7 +154,6 @@ config ZSWAP_ZPOOL_DEFAULT_Z3FOLD config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC bool "zsmalloc" - depends on HAVE_ZSMALLOC select ZSMALLOC help Use the zsmalloc allocator as the default allocator. @@ -187,15 +186,10 @@ config Z3FOLD page. It is a ZBUD derivative so the simplicity and determinism are still there. -config HAVE_ZSMALLOC - def_bool y - depends on MMU - depends on PAGE_SIZE_LESS_THAN_256KB # we want <= 64 KiB - config ZSMALLOC tristate prompt "N:1 compression allocator (zsmalloc)" if ZSWAP - depends on HAVE_ZSMALLOC + depends on MMU help zsmalloc is a slab-based memory allocator designed to store pages of various compression levels efficiently. It achieves diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2d3163e4da96..73a3ec5b21ad 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -20,7 +20,7 @@ * page->index: links together all component pages of a zspage * For the huge page, this is always 0, so we use this field * to store handle. - * page->page_type: PG_zsmalloc, lower 16 bit locate the first object + * page->page_type: PGTY_zsmalloc, lower 24 bits locate the first object * offset in a subpage of a zspage * * Usage of struct page flags: @@ -452,13 +452,7 @@ static inline struct page *get_first_page(struct zspage *zspage) return first_page; } -#define FIRST_OBJ_PAGE_TYPE_MASK 0xffff - -static inline void reset_first_obj_offset(struct page *page) -{ - VM_WARN_ON_ONCE(!PageZsmalloc(page)); - page->page_type |= FIRST_OBJ_PAGE_TYPE_MASK; -} +#define FIRST_OBJ_PAGE_TYPE_MASK 0xffffff static inline unsigned int get_first_obj_offset(struct page *page) { @@ -468,8 +462,8 @@ static inline unsigned int get_first_obj_offset(struct page *page) static inline void set_first_obj_offset(struct page *page, unsigned int offset) { - /* With 16 bit available, we can support offsets into 64 KiB pages. */ - BUILD_BUG_ON(PAGE_SIZE > SZ_64K); + /* With 24 bits available, we can support offsets into 16 MiB pages. */ + BUILD_BUG_ON(PAGE_SIZE > SZ_16M); VM_WARN_ON_ONCE(!PageZsmalloc(page)); VM_WARN_ON_ONCE(offset & ~FIRST_OBJ_PAGE_TYPE_MASK); page->page_type &= ~FIRST_OBJ_PAGE_TYPE_MASK; @@ -808,7 +802,6 @@ static void reset_page(struct page *page) ClearPagePrivate(page); set_page_private(page, 0); page->index = 0; - reset_first_obj_offset(page); __ClearPageZsmalloc(page); } From bf03c806993091d8fb2fc0e98f07a8ef251c78f3 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:34 +0100 Subject: [PATCH 228/416] mm: remove PageActive Patch series "Simplify the page flags a little". In the course of our folio conversions, we have made many page flags only used on folios, so we can now remove the page-based accessors. This should cut down compile time a little, and prevent new users from cropping up. There is more that could be done in this area, but it would produce merge conflicts, so I'll sit on those patches until next merge window. We now have line of sight to removing PG_private_2 and PG_private. This patch (of 10): This flag is now only used on folios, so we can remove all the page accessors. [akpm@linux-foundation.org: fix arch/powerpc/mm/pgtable-frag.c] Link: https://lkml.kernel.org/r/20240821193445.2294269-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240821193445.2294269-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- arch/powerpc/mm/pgtable-frag.c | 6 +++--- include/linux/page-flags.h | 5 +++-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 8c31802f97e8..e89f64a0f24a 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -136,10 +136,10 @@ void pte_fragment_free(unsigned long *table, int kernel) #ifdef CONFIG_TRANSPARENT_HUGEPAGE void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) { - struct page *page; + struct folio *folio; - page = virt_to_page(pgtable); - SetPageActive(page); + folio = virt_to_folio(pgtable); + folio_set_active(folio); pte_fragment_free((unsigned long *)pgtable, 0); } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7a1b2948b075..34347bee1217 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -510,8 +510,9 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD) __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) TESTCLEARFLAG(LRU, lru, PF_HEAD) -PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) - TESTCLEARFLAG(Active, active, PF_HEAD) +FOLIO_FLAG(active, FOLIO_HEAD_PAGE) + __FOLIO_CLEAR_FLAG(active, FOLIO_HEAD_PAGE) + FOLIO_TEST_CLEAR_FLAG(active, FOLIO_HEAD_PAGE) PAGEFLAG(Workingset, workingset, PF_HEAD) TESTCLEARFLAG(Workingset, workingset, PF_HEAD) PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */ From 0b7582803649e0e58e5f8c9a4ede362777bc7eec Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:35 +0100 Subject: [PATCH 229/416] mm: remove PageSwapBacked This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/20240821193445.2294269-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 34347bee1217..cebb3dc0fd0c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -528,9 +528,9 @@ PAGEFLAG(XenRemapped, xen_remapped, PF_NO_COMPOUND) PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND) __CLEARPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND) __SETPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND) -PAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) - __CLEARPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) - __SETPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) +FOLIO_FLAG(swapbacked, FOLIO_HEAD_PAGE) + __FOLIO_CLEAR_FLAG(swapbacked, FOLIO_HEAD_PAGE) + __FOLIO_SET_FLAG(swapbacked, FOLIO_HEAD_PAGE) /* * Private page markings that may be used by the filesystem that owns the page From 6f394ee9ddb4bd1879e42c9c8648a37f650bea99 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:36 +0100 Subject: [PATCH 230/416] mm: remove PageReadahead This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/20240821193445.2294269-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index cebb3dc0fd0c..71444223d630 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -553,8 +553,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) /* PG_readahead is only used for reads; PG_reclaim is only for writes */ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL) TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL) -PAGEFLAG(Readahead, readahead, PF_NO_COMPOUND) - TESTCLEARFLAG(Readahead, readahead, PF_NO_COMPOUND) +FOLIO_FLAG(readahead, FOLIO_HEAD_PAGE) + FOLIO_TEST_CLEAR_FLAG(readahead, FOLIO_HEAD_PAGE) #ifdef CONFIG_HIGHMEM /* From 32f51ead3d7771cdec29f75e08d50a76d2c6253d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:37 +0100 Subject: [PATCH 231/416] mm: remove PageSwapCache This flag is now only used on folios, so we can remove all the page accessors and reword the comments that refer to them. Link: https://lkml.kernel.org/r/20240821193445.2294269-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/mm_types.h | 2 +- include/linux/page-flags.h | 11 +++-------- mm/ksm.c | 19 ++++++++++--------- mm/migrate.c | 3 ++- mm/shmem.c | 11 ++++++----- 5 files changed, 22 insertions(+), 24 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 2419e60c9a7f..6e3bdf8e38bc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -109,7 +109,7 @@ struct page { /** * @private: Mapping-private opaque data. * Usually used for buffer_heads if PagePrivate. - * Used for swp_entry_t if PageSwapCache. + * Used for swp_entry_t if swapcache flag set. * Indicates order in the buddy system if PageBuddy. */ unsigned long private; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 71444223d630..b31d2c50b6f0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -574,15 +574,10 @@ static __always_inline bool folio_test_swapcache(const struct folio *folio) test_bit(PG_swapcache, const_folio_flags(folio, 0)); } -static __always_inline bool PageSwapCache(const struct page *page) -{ - return folio_test_swapcache(page_folio(page)); -} - -SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) -CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) +FOLIO_SET_FLAG(swapcache, FOLIO_HEAD_PAGE) +FOLIO_CLEAR_FLAG(swapcache, FOLIO_HEAD_PAGE) #else -PAGEFLAG_FALSE(SwapCache, swapcache) +FOLIO_FLAG_FALSE(swapcache) #endif PAGEFLAG(Unevictable, unevictable, PF_HEAD) diff --git a/mm/ksm.c b/mm/ksm.c index 8e53666bc7b0..a2e2a521df0a 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -909,12 +909,13 @@ again: */ while (!folio_try_get(folio)) { /* - * Another check for page->mapping != expected_mapping would - * work here too. We have chosen the !PageSwapCache test to - * optimize the common case, when the page is or is about to - * be freed: PageSwapCache is cleared (under spin_lock_irq) - * in the ref_freeze section of __remove_mapping(); but Anon - * folio->mapping reset to NULL later, in free_pages_prepare(). + * Another check for folio->mapping != expected_mapping + * would work here too. We have chosen to test the + * swapcache flag to optimize the common case, when the + * folio is or is about to be freed: the swapcache flag + * is cleared (under spin_lock_irq) in the ref_freeze + * section of __remove_mapping(); but anon folio->mapping + * is reset to NULL later, in free_pages_prepare(). */ if (!folio_test_swapcache(folio)) goto stale; @@ -945,7 +946,7 @@ again: stale: /* - * We come here from above when page->mapping or !PageSwapCache + * We come here from above when folio->mapping or the swapcache flag * suggests that the node is stale; but it might be under migration. * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(), * before checking whether node->kpfn has been changed. @@ -1452,7 +1453,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, goto out; /* - * We need the page lock to read a stable PageSwapCache in + * We need the folio lock to read a stable swapcache flag in * write_protect_page(). We use trylock_page() instead of * lock_page() because we don't want to wait here - we * prefer to continue scanning and merging different pages, @@ -3123,7 +3124,7 @@ void folio_migrate_ksm(struct folio *newfolio, struct folio *folio) * newfolio->mapping was set in advance; now we need smp_wmb() * to make sure that the new stable_node->kpfn is visible * to ksm_get_folio() before it can see that folio->mapping - * has gone stale (or that folio_test_swapcache has been cleared). + * has gone stale (or that the swapcache flag has been cleared). */ smp_wmb(); folio_set_stable_node(folio, NULL); diff --git a/mm/migrate.c b/mm/migrate.c index 76cfc6c42eb3..2250baa85ac2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -639,7 +639,8 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) folio_migrate_ksm(newfolio, folio); /* * Please do not reorder this without considering how mm/ksm.c's - * ksm_get_folio() depends upon ksm_migrate_page() and PageSwapCache(). + * ksm_get_folio() depends upon ksm_migrate_page() and the + * swapcache flag. */ if (folio_test_swapcache(folio)) folio_clear_swapcache(folio); diff --git a/mm/shmem.c b/mm/shmem.c index 1448a7a9a282..866d46d0c43d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -502,8 +502,8 @@ static int shmem_replace_entry(struct address_space *mapping, * Sometimes, before we decide whether to proceed or to fail, we must check * that an entry was not already brought back from swap by a racing thread. * - * Checking page is not enough: by the time a SwapCache page is locked, it - * might be reused, and again be SwapCache, using the same swap as before. + * Checking folio is not enough: by the time a swapcache folio is locked, it + * might be reused, and again be swapcache, using the same swap as before. */ static bool shmem_confirm_swap(struct address_space *mapping, pgoff_t index, swp_entry_t swap) @@ -1965,9 +1965,10 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, if (unlikely(error)) { /* - * Is this possible? I think not, now that our callers check - * both PageSwapCache and page_private after getting page lock; - * but be defensive. Reverse old to newpage for clear and free. + * Is this possible? I think not, now that our callers + * check both the swapcache flag and folio->private + * after getting the folio lock; but be defensive. + * Reverse old to newpage for clear and free. */ old = new; } else { From cb29e7941d5d57576392a52c08077ef327ec8e17 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:38 +0100 Subject: [PATCH 232/416] mm: remove PageUnevictable There is only one caller of PageUnevictable() left; convert it to call folio_test_unevictable() and remove all the page accessors. Link: https://lkml.kernel.org/r/20240821193445.2294269-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 6 +++--- mm/huge_memory.c | 16 ++++++++-------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b31d2c50b6f0..2150502195d5 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -580,9 +580,9 @@ FOLIO_CLEAR_FLAG(swapcache, FOLIO_HEAD_PAGE) FOLIO_FLAG_FALSE(swapcache) #endif -PAGEFLAG(Unevictable, unevictable, PF_HEAD) - __CLEARPAGEFLAG(Unevictable, unevictable, PF_HEAD) - TESTCLEARFLAG(Unevictable, unevictable, PF_HEAD) +FOLIO_FLAG(unevictable, FOLIO_HEAD_PAGE) + __FOLIO_CLEAR_FLAG(unevictable, FOLIO_HEAD_PAGE) + FOLIO_TEST_CLEAR_FLAG(unevictable, FOLIO_HEAD_PAGE) #ifdef CONFIG_MMU PAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ef6d4f617c4a..8dfb912ebdaa 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2932,25 +2932,25 @@ static void remap_page(struct folio *folio, unsigned long nr) } } -static void lru_add_page_tail(struct page *head, struct page *tail, +static void lru_add_page_tail(struct folio *folio, struct page *tail, struct lruvec *lruvec, struct list_head *list) { - VM_BUG_ON_PAGE(!PageHead(head), head); - VM_BUG_ON_PAGE(PageLRU(tail), head); + VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); + VM_BUG_ON_FOLIO(PageLRU(tail), folio); lockdep_assert_held(&lruvec->lru_lock); if (list) { /* page reclaim is reclaiming a huge page */ - VM_WARN_ON(PageLRU(head)); + VM_WARN_ON(folio_test_lru(folio)); get_page(tail); list_add_tail(&tail->lru, list); } else { /* head is still on lru (and we have it frozen) */ - VM_WARN_ON(!PageLRU(head)); - if (PageUnevictable(tail)) + VM_WARN_ON(!folio_test_lru(folio)); + if (folio_test_unevictable(folio)) tail->mlock_count = 0; else - list_add_tail(&tail->lru, &head->lru); + list_add_tail(&tail->lru, &folio->lru); SetPageLRU(tail); } } @@ -3049,7 +3049,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail, * pages to show after the currently processed elements - e.g. * migrate_pages */ - lru_add_page_tail(head, page_tail, lruvec, list); + lru_add_page_tail(folio, page_tail, lruvec, list); } static void __split_huge_page(struct page *page, struct list_head *list, From 99f86bbda31790f2ca0e365f02650585536929ab Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:39 +0100 Subject: [PATCH 233/416] mm: remove PageMlocked This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/20240821193445.2294269-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 4 ++-- include/linux/page-flags.h | 13 ++++++++----- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 2feb2ed51ae2..255ef12a432b 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -253,8 +253,8 @@ Basic Management mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable pages. When such a page has been "noticed" by the memory management subsystem, -the page is marked with the PG_mlocked flag. This can be manipulated using the -PageMlocked() functions. +the folio is marked with the PG_mlocked flag. This can be manipulated using +folio_set_mlocked() and folio_clear_mlocked() functions. A PG_mlocked page will be placed on the unevictable list when it is added to the LRU. Such pages can be "noticed" by memory management in several places: diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 2150502195d5..08cce16c82aa 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -585,12 +585,15 @@ FOLIO_FLAG(unevictable, FOLIO_HEAD_PAGE) FOLIO_TEST_CLEAR_FLAG(unevictable, FOLIO_HEAD_PAGE) #ifdef CONFIG_MMU -PAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) - __CLEARPAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) - TESTSCFLAG(Mlocked, mlocked, PF_NO_TAIL) +FOLIO_FLAG(mlocked, FOLIO_HEAD_PAGE) + __FOLIO_CLEAR_FLAG(mlocked, FOLIO_HEAD_PAGE) + FOLIO_TEST_CLEAR_FLAG(mlocked, FOLIO_HEAD_PAGE) + FOLIO_TEST_SET_FLAG(mlocked, FOLIO_HEAD_PAGE) #else -PAGEFLAG_FALSE(Mlocked, mlocked) __CLEARPAGEFLAG_NOOP(Mlocked, mlocked) - TESTSCFLAG_FALSE(Mlocked, mlocked) +FOLIO_FLAG_FALSE(mlocked) + __FOLIO_CLEAR_FLAG_NOOP(mlocked) + FOLIO_TEST_CLEAR_FLAG_FALSE(mlocked) + FOLIO_TEST_SET_FLAG_FALSE(mlocked) #endif #ifdef CONFIG_ARCH_USES_PG_UNCACHED From 3026bc1e82b695d69cde21712512922abe74aab9 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:40 +0100 Subject: [PATCH 234/416] mm: remove PageOwnerPriv1 While there are many aliases for this flag, nobody actually uses the *PageOwnerPriv1() nor folio_*_owner_priv_1() accessors. Remove them. Link: https://lkml.kernel.org/r/20240821193445.2294269-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 08cce16c82aa..458d2dc5124d 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -539,8 +539,6 @@ FOLIO_FLAG(swapbacked, FOLIO_HEAD_PAGE) */ PAGEFLAG(Private, private, PF_ANY) PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY) -PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY) - TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY) /* * Only test-and-set exist for PG_writeback. The unconditional operators are From 6dc151388e44957d471633ce70d72e3d27ae836d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:41 +0100 Subject: [PATCH 235/416] mm: remove page_has_private() This function has no more callers, except folio_has_private(). Combine the two functions. Link: https://lkml.kernel.org/r/20240821193445.2294269-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 458d2dc5124d..017205048cab 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1175,20 +1175,15 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) /** - * page_has_private - Determine if page has private stuff - * @page: The page to be checked + * folio_has_private - Determine if folio has private stuff + * @folio: The folio to be checked * - * Determine if a page has private stuff, indicating that release routines + * Determine if a folio has private stuff, indicating that release routines * should be invoked upon it. */ -static inline int page_has_private(const struct page *page) +static inline int folio_has_private(const struct folio *folio) { - return !!(page->flags & PAGE_FLAGS_PRIVATE); -} - -static inline bool folio_has_private(const struct folio *folio) -{ - return page_has_private(&folio->page); + return !!(folio->flags & PAGE_FLAGS_PRIVATE); } #undef PF_ANY From 02e1960aafac33721401dcd92e915325fdb524b2 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:42 +0100 Subject: [PATCH 236/416] mm: rename PG_mappedtodisk to PG_owner_2 This flag has similar constraints to PG_owner_priv_1 -- it is ignored by core code, and is entirely for the use of the code which allocated the folio. Since the pagecache does not use it, individual filesystems can use it. The bufferhead code does use it, so filesystems which use the buffer cache must not use it for another purpose. Link: https://lkml.kernel.org/r/20240821193445.2294269-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- fs/proc/page.c | 2 +- include/linux/kernel-page-flags.h | 2 +- include/linux/page-flags.h | 24 ++++++++++++++++-------- include/trace/events/mmflags.h | 2 +- tools/mm/page-types.c | 10 +++++----- 5 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 73a0f872d97f..e74e639893be 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -211,7 +211,7 @@ u64 stable_page_flags(const struct page *page) #endif u |= kpf_copy_bit(k, KPF_RESERVED, PG_reserved); - u |= kpf_copy_bit(k, KPF_MAPPEDTODISK, PG_mappedtodisk); + u |= kpf_copy_bit(k, KPF_OWNER_2, PG_owner_2); u |= kpf_copy_bit(k, KPF_PRIVATE, PG_private); u |= kpf_copy_bit(k, KPF_PRIVATE_2, PG_private_2); u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE, PG_owner_priv_1); diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h index 859f4b0c1b2b..7c587a711be1 100644 --- a/include/linux/kernel-page-flags.h +++ b/include/linux/kernel-page-flags.h @@ -10,7 +10,7 @@ */ #define KPF_RESERVED 32 #define KPF_MLOCKED 33 -#define KPF_MAPPEDTODISK 34 +#define KPF_OWNER_2 34 #define KPF_PRIVATE 35 #define KPF_PRIVATE_2 36 #define KPF_OWNER_PRIVATE 37 diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 017205048cab..1ff3d172c22c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -101,12 +101,12 @@ enum pageflags { PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ PG_active, PG_workingset, - PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ + PG_owner_priv_1, /* Owner use. If pagecache, fs may use */ + PG_owner_2, /* Owner use. If pagecache, fs may use */ PG_arch_1, PG_reserved, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ - PG_mappedtodisk, /* Has blocks allocated on-disk */ PG_reclaim, /* To be reclaimed asap */ PG_swapbacked, /* Page is backed by RAM/swap */ PG_unevictable, /* Page is "unevictable" */ @@ -131,6 +131,11 @@ enum pageflags { PG_readahead = PG_reclaim, + /* Anonymous memory (and shmem) */ + PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */ + /* Some filesystems */ + PG_checked = PG_owner_priv_1, + /* * Depending on the way an anonymous folio can be mapped into a page * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped @@ -138,13 +143,13 @@ enum pageflags { * tail pages of an anonymous folio. For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ - PG_anon_exclusive = PG_mappedtodisk, + PG_anon_exclusive = PG_owner_2, - /* Filesystems */ - PG_checked = PG_owner_priv_1, - - /* SwapBacked */ - PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */ + /* + * Set if all buffer heads in the folio are mapped. + * Filesystems which do not use BHs can use it for their own purpose. + */ + PG_mappedtodisk = PG_owner_2, /* Two page bits are conscripted by FS-Cache to maintain local caching * state. These bits are set on pages belonging to the netfs's inodes @@ -540,6 +545,9 @@ FOLIO_FLAG(swapbacked, FOLIO_HEAD_PAGE) PAGEFLAG(Private, private, PF_ANY) PAGEFLAG(Private2, private_2, PF_ANY) TESTSCFLAG(Private2, private_2, PF_ANY) +/* owner_2 can be set on tail pages for anon memory */ +FOLIO_FLAG(owner_2, FOLIO_HEAD_PAGE) + /* * Only test-and-set exist for PG_writeback. The unconditional operators are * risky: they bypass page accounting. diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index c151cc21d367..3b51558cdc9b 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -107,13 +107,13 @@ DEF_PAGEFLAG_NAME(active), \ DEF_PAGEFLAG_NAME(workingset), \ DEF_PAGEFLAG_NAME(owner_priv_1), \ + DEF_PAGEFLAG_NAME(owner_2), \ DEF_PAGEFLAG_NAME(arch_1), \ DEF_PAGEFLAG_NAME(reserved), \ DEF_PAGEFLAG_NAME(private), \ DEF_PAGEFLAG_NAME(private_2), \ DEF_PAGEFLAG_NAME(writeback), \ DEF_PAGEFLAG_NAME(head), \ - DEF_PAGEFLAG_NAME(mappedtodisk), \ DEF_PAGEFLAG_NAME(reclaim), \ DEF_PAGEFLAG_NAME(swapbacked), \ DEF_PAGEFLAG_NAME(unevictable) \ diff --git a/tools/mm/page-types.c b/tools/mm/page-types.c index 8d5595b6c59f..8ca41c41105e 100644 --- a/tools/mm/page-types.c +++ b/tools/mm/page-types.c @@ -71,7 +71,7 @@ /* [32-] kernel hacking assistances */ #define KPF_RESERVED 32 #define KPF_MLOCKED 33 -#define KPF_MAPPEDTODISK 34 +#define KPF_OWNER_2 34 #define KPF_PRIVATE 35 #define KPF_PRIVATE_2 36 #define KPF_OWNER_PRIVATE 37 @@ -129,7 +129,7 @@ static const char * const page_flag_names[] = { [KPF_RESERVED] = "r:reserved", [KPF_MLOCKED] = "m:mlocked", - [KPF_MAPPEDTODISK] = "d:mappedtodisk", + [KPF_OWNER_2] = "d:owner_2", [KPF_PRIVATE] = "P:private", [KPF_PRIVATE_2] = "p:private_2", [KPF_OWNER_PRIVATE] = "O:owner_private", @@ -472,9 +472,9 @@ static int bit_mask_ok(uint64_t flags) static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme) { - /* Anonymous pages overload PG_mappedtodisk */ - if ((flags & BIT(ANON)) && (flags & BIT(MAPPEDTODISK))) - flags ^= BIT(MAPPEDTODISK) | BIT(ANON_EXCLUSIVE); + /* Anonymous pages use PG_owner_2 for anon_exclusive */ + if ((flags & BIT(ANON)) && (flags & BIT(OWNER_2))) + flags ^= BIT(OWNER_2) | BIT(ANON_EXCLUSIVE); /* SLUB overloads several page flags */ if (flags & BIT(SLAB)) { From 7a87225ae2c6c317c7b80cf599e5cf0eee699196 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 21 Aug 2024 20:34:43 +0100 Subject: [PATCH 237/416] x86: remove PG_uncached Convert x86 to use PG_arch_2 instead of PG_uncached and remove PG_uncached. Link: https://lkml.kernel.org/r/20240821193445.2294269-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- .../features/vm/PG_uncached/arch-support.txt | 30 ------------------- arch/arm64/Kconfig | 3 +- arch/x86/Kconfig | 5 +--- arch/x86/mm/pat/memtype.c | 8 ++--- fs/proc/page.c | 8 ++--- include/linux/kernel-page-flags.h | 1 - include/linux/page-flags.h | 13 ++------ include/trace/events/mmflags.h | 23 +++++++------- mm/Kconfig | 9 ++---- mm/huge_memory.c | 4 ++- tools/mm/page-types.c | 3 +- 11 files changed, 31 insertions(+), 76 deletions(-) delete mode 100644 Documentation/features/vm/PG_uncached/arch-support.txt diff --git a/Documentation/features/vm/PG_uncached/arch-support.txt b/Documentation/features/vm/PG_uncached/arch-support.txt deleted file mode 100644 index 5a7508b8c967..000000000000 --- a/Documentation/features/vm/PG_uncached/arch-support.txt +++ /dev/null @@ -1,30 +0,0 @@ -# -# Feature name: PG_uncached -# Kconfig: ARCH_USES_PG_UNCACHED -# description: arch supports the PG_uncached page flag -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | TODO | - | arm: | TODO | - | arm64: | TODO | - | csky: | TODO | - | hexagon: | TODO | - | loongarch: | TODO | - | m68k: | TODO | - | microblaze: | TODO | - | mips: | TODO | - | nios2: | TODO | - | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | TODO | - | riscv: | TODO | - | s390: | TODO | - | sh: | TODO | - | sparc: | TODO | - | um: | TODO | - | x86: | ok | - | xtensa: | TODO | - ----------------------- diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..6494848019a0 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2100,7 +2100,8 @@ config ARM64_MTE depends on ARM64_PAN select ARCH_HAS_SUBPAGE_FAULTS select ARCH_USES_HIGH_VMA_FLAGS - select ARCH_USES_PG_ARCH_X + select ARCH_USES_PG_ARCH_2 + select ARCH_USES_PG_ARCH_3 help Memory Tagging (part of the ARMv8.5 Extensions) provides architectural support for run-time, always-on detection of diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index acd9745bf2ae..b74b9ee484da 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1799,6 +1799,7 @@ config X86_PAT def_bool y prompt "x86 PAT support" if EXPERT depends on MTRR + select ARCH_USES_PG_ARCH_2 help Use PAT attributes to setup page level cache control. @@ -1810,10 +1811,6 @@ config X86_PAT If unsure, say Y. -config ARCH_USES_PG_UNCACHED - def_bool y - depends on X86_PAT - config X86_UMIP def_bool y prompt "User Mode Instruction Prevention" if EXPERT diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index bdc2a240c2aa..1fa0bf6ed295 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -104,7 +104,7 @@ __setup("debugpat", pat_debug_setup); #ifdef CONFIG_X86_PAT /* - * X86 PAT uses page flags arch_1 and uncached together to keep track of + * X86 PAT uses page flags arch_1 and arch_2 together to keep track of * memory type of pages that have backing page struct. * * X86 PAT supports 4 different memory types: @@ -118,9 +118,9 @@ __setup("debugpat", pat_debug_setup); #define _PGMT_WB 0 #define _PGMT_WC (1UL << PG_arch_1) -#define _PGMT_UC_MINUS (1UL << PG_uncached) -#define _PGMT_WT (1UL << PG_uncached | 1UL << PG_arch_1) -#define _PGMT_MASK (1UL << PG_uncached | 1UL << PG_arch_1) +#define _PGMT_UC_MINUS (1UL << PG_arch_2) +#define _PGMT_WT (1UL << PG_arch_2 | 1UL << PG_arch_1) +#define _PGMT_MASK (1UL << PG_arch_2 | 1UL << PG_arch_1) #define _PGMT_CLEAR_MASK (~_PGMT_MASK) static inline enum page_cache_mode get_page_memtype(struct page *pg) diff --git a/fs/proc/page.c b/fs/proc/page.c index e74e639893be..a55f5acefa97 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -206,18 +206,16 @@ u64 stable_page_flags(const struct page *page) u |= kpf_copy_bit(page->flags, KPF_HWPOISON, PG_hwpoison); #endif -#ifdef CONFIG_ARCH_USES_PG_UNCACHED - u |= kpf_copy_bit(k, KPF_UNCACHED, PG_uncached); -#endif - u |= kpf_copy_bit(k, KPF_RESERVED, PG_reserved); u |= kpf_copy_bit(k, KPF_OWNER_2, PG_owner_2); u |= kpf_copy_bit(k, KPF_PRIVATE, PG_private); u |= kpf_copy_bit(k, KPF_PRIVATE_2, PG_private_2); u |= kpf_copy_bit(k, KPF_OWNER_PRIVATE, PG_owner_priv_1); u |= kpf_copy_bit(k, KPF_ARCH, PG_arch_1); -#ifdef CONFIG_ARCH_USES_PG_ARCH_X +#ifdef CONFIG_ARCH_USES_PG_ARCH_2 u |= kpf_copy_bit(k, KPF_ARCH_2, PG_arch_2); +#endif +#ifdef CONFIG_ARCH_USES_PG_ARCH_3 u |= kpf_copy_bit(k, KPF_ARCH_3, PG_arch_3); #endif diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h index 7c587a711be1..196778a087c4 100644 --- a/include/linux/kernel-page-flags.h +++ b/include/linux/kernel-page-flags.h @@ -15,7 +15,6 @@ #define KPF_PRIVATE_2 36 #define KPF_OWNER_PRIVATE 37 #define KPF_ARCH 38 -#define KPF_UNCACHED 39 #define KPF_SOFTDIRTY 40 #define KPF_ARCH_2 41 #define KPF_ARCH_3 42 diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1ff3d172c22c..2175ebceb41c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -113,9 +113,6 @@ enum pageflags { #ifdef CONFIG_MMU PG_mlocked, /* Page is vma mlocked */ #endif -#ifdef CONFIG_ARCH_USES_PG_UNCACHED - PG_uncached, /* Page has been mapped as uncached */ -#endif #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, /* hardware poisoned page. Don't touch */ #endif @@ -123,8 +120,10 @@ enum pageflags { PG_young, PG_idle, #endif -#ifdef CONFIG_ARCH_USES_PG_ARCH_X +#ifdef CONFIG_ARCH_USES_PG_ARCH_2 PG_arch_2, +#endif +#ifdef CONFIG_ARCH_USES_PG_ARCH_3 PG_arch_3, #endif __NR_PAGEFLAGS, @@ -602,12 +601,6 @@ FOLIO_FLAG_FALSE(mlocked) FOLIO_TEST_SET_FLAG_FALSE(mlocked) #endif -#ifdef CONFIG_ARCH_USES_PG_UNCACHED -PAGEFLAG(Uncached, uncached, PF_NO_COMPOUND) -#else -PAGEFLAG_FALSE(Uncached, uncached) -#endif - #ifdef CONFIG_MEMORY_FAILURE PAGEFLAG(HWPoison, hwpoison, PF_ANY) TESTSCFLAG(HWPoison, hwpoison, PF_ANY) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 3b51558cdc9b..58f2699331b6 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -71,12 +71,6 @@ #define IF_HAVE_PG_MLOCK(_name) #endif -#ifdef CONFIG_ARCH_USES_PG_UNCACHED -#define IF_HAVE_PG_UNCACHED(_name) ,{1UL << PG_##_name, __stringify(_name)} -#else -#define IF_HAVE_PG_UNCACHED(_name) -#endif - #ifdef CONFIG_MEMORY_FAILURE #define IF_HAVE_PG_HWPOISON(_name) ,{1UL << PG_##_name, __stringify(_name)} #else @@ -89,10 +83,16 @@ #define IF_HAVE_PG_IDLE(_name) #endif -#ifdef CONFIG_ARCH_USES_PG_ARCH_X -#define IF_HAVE_PG_ARCH_X(_name) ,{1UL << PG_##_name, __stringify(_name)} +#ifdef CONFIG_ARCH_USES_PG_ARCH_2 +#define IF_HAVE_PG_ARCH_2(_name) ,{1UL << PG_##_name, __stringify(_name)} #else -#define IF_HAVE_PG_ARCH_X(_name) +#define IF_HAVE_PG_ARCH_2(_name) +#endif + +#ifdef CONFIG_ARCH_USES_PG_ARCH_3 +#define IF_HAVE_PG_ARCH_3(_name) ,{1UL << PG_##_name, __stringify(_name)} +#else +#define IF_HAVE_PG_ARCH_3(_name) #endif #define DEF_PAGEFLAG_NAME(_name) { 1UL << PG_##_name, __stringify(_name) } @@ -118,12 +118,11 @@ DEF_PAGEFLAG_NAME(swapbacked), \ DEF_PAGEFLAG_NAME(unevictable) \ IF_HAVE_PG_MLOCK(mlocked) \ -IF_HAVE_PG_UNCACHED(uncached) \ IF_HAVE_PG_HWPOISON(hwpoison) \ IF_HAVE_PG_IDLE(idle) \ IF_HAVE_PG_IDLE(young) \ -IF_HAVE_PG_ARCH_X(arch_2) \ -IF_HAVE_PG_ARCH_X(arch_3) +IF_HAVE_PG_ARCH_2(arch_2) \ +IF_HAVE_PG_ARCH_3(arch_3) #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/Kconfig b/mm/Kconfig index 5946dcdcaeda..8078a4b3c509 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1079,13 +1079,10 @@ config ARCH_USES_HIGH_VMA_FLAGS config ARCH_HAS_PKEYS bool -config ARCH_USES_PG_ARCH_X +config ARCH_USES_PG_ARCH_2 + bool +config ARCH_USES_PG_ARCH_3 bool - help - Enable the definition of PG_arch_x page flags with x > 1. Only - suitable for 64-bit architectures with CONFIG_FLATMEM or - CONFIG_SPARSEMEM_VMEMMAP enabled, otherwise there may not be - enough room for additional bits in page->flags. config VM_EVENT_COUNTERS default y diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8dfb912ebdaa..c4b45ad576f6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2993,8 +2993,10 @@ static void __split_huge_page_tail(struct folio *folio, int tail, (1L << PG_workingset) | (1L << PG_locked) | (1L << PG_unevictable) | -#ifdef CONFIG_ARCH_USES_PG_ARCH_X +#ifdef CONFIG_ARCH_USES_PG_ARCH_2 (1L << PG_arch_2) | +#endif +#ifdef CONFIG_ARCH_USES_PG_ARCH_3 (1L << PG_arch_3) | #endif (1L << PG_dirty) | diff --git a/tools/mm/page-types.c b/tools/mm/page-types.c index 8ca41c41105e..fa050d5a48cd 100644 --- a/tools/mm/page-types.c +++ b/tools/mm/page-types.c @@ -76,7 +76,7 @@ #define KPF_PRIVATE_2 36 #define KPF_OWNER_PRIVATE 37 #define KPF_ARCH 38 -#define KPF_UNCACHED 39 +#define KPF_UNCACHED 39 /* unused */ #define KPF_SOFTDIRTY 40 #define KPF_ARCH_2 41 @@ -134,7 +134,6 @@ static const char * const page_flag_names[] = { [KPF_PRIVATE_2] = "p:private_2", [KPF_OWNER_PRIVATE] = "O:owner_private", [KPF_ARCH] = "h:arch", - [KPF_UNCACHED] = "c:uncached", [KPF_SOFTDIRTY] = "f:softdirty", [KPF_ARCH_2] = "H:arch_2", From c41a701d18efe6b8aa402efab16edbaba50c9548 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 21 Aug 2024 14:31:15 +0200 Subject: [PATCH 238/416] selftests/mm: fix charge_reserved_hugetlb.sh test Currently, running the charge_reserved_hugetlb.sh selftest we can sometimes observe something like: $ ./charge_reserved_hugetlb.sh -cgroup-v2 ... write_result is 0 After write: hugetlb_usage=0 reserved_usage=10485760 killing write_to_hugetlbfs Received 2. Deleting the memory Detach failure: Invalid argument umount: /mnt/huge: target is busy. Both cases are issues in the test. While the unmount error seems to be racy, it will make the test fail: $ ./run_vmtests.sh -t hugetlb ... # [FAIL] not ok 10 charge_reserved_hugetlb.sh -cgroup-v2 # exit=32 The issue is that we are not waiting for the write_to_hugetlbfs process to quit. So it might still have a hugetlbfs file open, about which umount is not happy. Fix that by making "killall" wait for the process to quit. The other error ("Detach failure: Invalid argument") does not seem to result in a test error, but is misleading. Turns out write_to_hugetlbfs.c unconditionally tries to cleanup using shmdt(), even when we only mmap()'ed a hugetlb file. Even worse, shmaddr is never even set for the SHM case. Fix that as well. With this change it seems to work as expected. Link: https://lkml.kernel.org/r/20240821123115.2068812-1-david@redhat.com Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests") Signed-off-by: David Hildenbrand Reported-by: Mario Casquero Reviewed-by: Mina Almasry Tested-by: Mario Casquero Cc: Shuah Khan Cc: Muchun Song Signed-off-by: Andrew Morton --- .../selftests/mm/charge_reserved_hugetlb.sh | 2 +- .../testing/selftests/mm/write_to_hugetlbfs.c | 23 +++++++++++-------- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh index d680c00d2853..67df7b47087f 100755 --- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh +++ b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh @@ -254,7 +254,7 @@ function cleanup_hugetlb_memory() { local cgroup="$1" if [[ "$(pgrep -f write_to_hugetlbfs)" != "" ]]; then echo killing write_to_hugetlbfs - killall -2 write_to_hugetlbfs + killall -2 --wait write_to_hugetlbfs wait_for_hugetlb_memory_to_get_depleted $cgroup fi set -e diff --git a/tools/testing/selftests/mm/write_to_hugetlbfs.c b/tools/testing/selftests/mm/write_to_hugetlbfs.c index 6a2caba19ee1..1289d311efd7 100644 --- a/tools/testing/selftests/mm/write_to_hugetlbfs.c +++ b/tools/testing/selftests/mm/write_to_hugetlbfs.c @@ -28,7 +28,7 @@ enum method { /* Global variables. */ static const char *self; -static char *shmaddr; +static int *shmaddr; static int shmid; /* @@ -47,15 +47,17 @@ void sig_handler(int signo) { printf("Received %d.\n", signo); if (signo == SIGINT) { - printf("Deleting the memory\n"); - if (shmdt((const void *)shmaddr) != 0) { - perror("Detach failure"); - shmctl(shmid, IPC_RMID, NULL); - exit(4); - } + if (shmaddr) { + printf("Deleting the memory\n"); + if (shmdt((const void *)shmaddr) != 0) { + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(4); + } - shmctl(shmid, IPC_RMID, NULL); - printf("Done deleting the memory\n"); + shmctl(shmid, IPC_RMID, NULL); + printf("Done deleting the memory\n"); + } } exit(2); } @@ -211,7 +213,8 @@ int main(int argc, char **argv) shmctl(shmid, IPC_RMID, NULL); exit(2); } - printf("shmaddr: %p\n", ptr); + shmaddr = ptr; + printf("shmaddr: %p\n", shmaddr); break; default: From 0692fad55d3c1ed3766a1e8627af15b951097894 Mon Sep 17 00:00:00 2001 From: Yuesong Li Date: Wed, 21 Aug 2024 14:31:12 +0800 Subject: [PATCH 239/416] mm:page-writeback: use folio_next_index() helper in writeback_iter() Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Link: https://lkml.kernel.org/r/20240821063112.4053157-1-liyuesong@vivo.com Signed-off-by: Yuesong Li Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/page-writeback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4430ac68e4c4..f5448311c89e 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2612,7 +2612,7 @@ struct folio *writeback_iter(struct address_space *mapping, done: if (wbc->range_cyclic) - mapping->writeback_index = folio->index + folio_nr_pages(folio); + mapping->writeback_index = folio_next_index(folio); folio_batch_release(&wbc->fbatch); return NULL; } From b843786b0bd01ced7fcdbf3b033d68db2f7c61b2 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 22 Aug 2024 13:24:58 +0200 Subject: [PATCH 240/416] mm: swapfile: fix SSD detection with swapfile on btrfs We've been noticing a trend of significant lock contention in the swap subsystem as core counts have been increasing in our fleet. It turns out that our swapfiles on btrfs on flash were in fact using the old swap code for rotational storage. This turns out to be a detection issue in the swapon sequence: btrfs sets si->bdev during swap activation, which currently happens *after* swapon's SSD detection and cluster setup. Thus, none of the SSD optimizations and cluster lock splitting are enabled for btrfs swap. Rearrange the swapon sequence so that filesystem activation happens *before* determining swap behavior based on the backing device. Afterwards, the nonrotational drive is detected correctly: - Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k + Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k SS Link: https://lkml.kernel.org/r/20240822112707.351844-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: "Huang, Ying" Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/swapfile.c | 167 ++++++++++++++++++++++++++------------------------ 1 file changed, 87 insertions(+), 80 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 8ff58be40544..03dc77457510 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3198,29 +3198,15 @@ static unsigned long read_swap_header(struct swap_info_struct *si, static int setup_swap_map_and_extents(struct swap_info_struct *si, union swap_header *swap_header, unsigned char *swap_map, - struct swap_cluster_info *cluster_info, unsigned long maxpages, sector_t *span) { - unsigned int j, k; unsigned int nr_good_pages; + unsigned long i; int nr_extents; - unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; - unsigned long i, idx; nr_good_pages = maxpages - 1; /* omit header page */ - INIT_LIST_HEAD(&si->free_clusters); - INIT_LIST_HEAD(&si->full_clusters); - INIT_LIST_HEAD(&si->discard_clusters); - - for (i = 0; i < SWAP_NR_ORDERS; i++) { - INIT_LIST_HEAD(&si->nonfull_clusters[i]); - INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] = 0; - } - for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) @@ -3228,25 +3214,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, if (page_nr < maxpages) { swap_map[page_nr] = SWAP_MAP_BAD; nr_good_pages--; - /* - * Haven't marked the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, page_nr); } } - /* Haven't marked the cluster free yet, no list operation involved */ - for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(si, cluster_info, i); - if (nr_good_pages) { swap_map[0] = SWAP_MAP_BAD; - /* - * Not mark the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, 0); si->max = maxpages; si->pages = nr_good_pages; nr_extents = setup_swap_extents(si, span); @@ -3259,8 +3231,70 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, return -EINVAL; } + return nr_extents; +} + +static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, + union swap_header *swap_header, + unsigned long maxpages) +{ + unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); + unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; + struct swap_cluster_info *cluster_info; + unsigned long i, j, k, idx; + int cpu, err = -ENOMEM; + + cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); if (!cluster_info) - return nr_extents; + goto err; + + for (i = 0; i < nr_clusters; i++) + spin_lock_init(&cluster_info[i].lock); + + si->cluster_next_cpu = alloc_percpu(unsigned int); + if (!si->cluster_next_cpu) + goto err_free; + + /* Random start position to help with wear leveling */ + for_each_possible_cpu(cpu) + per_cpu(*si->cluster_next_cpu, cpu) = + get_random_u32_inclusive(1, si->highest_bit); + + si->percpu_cluster = alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; + + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; + + cluster = per_cpu_ptr(si->percpu_cluster, cpu); + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_NEXT_INVALID; + } + + /* + * Mark unusable pages as unavailable. The clusters aren't + * marked free yet, so no list operations are involved yet. + * + * See setup_swap_map_and_extents(): header page, bad pages, + * and the EOF part of the last cluster. + */ + inc_cluster_info_page(si, cluster_info, 0); + for (i = 0; i < swap_header->info.nr_badpages; i++) + inc_cluster_info_page(si, cluster_info, + swap_header->info.badpages[i]); + for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) + inc_cluster_info_page(si, cluster_info, i); + + INIT_LIST_HEAD(&si->free_clusters); + INIT_LIST_HEAD(&si->full_clusters); + INIT_LIST_HEAD(&si->discard_clusters); + + for (i = 0; i < SWAP_NR_ORDERS; i++) { + INIT_LIST_HEAD(&si->nonfull_clusters[i]); + INIT_LIST_HEAD(&si->frag_clusters[i]); + si->frag_cluster_nr[i] = 0; + } /* * Reduce false cache line sharing between cluster_info and @@ -3283,7 +3317,13 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, list_add_tail(&ci->list, &si->free_clusters); } } - return nr_extents; + + return cluster_info; + +err_free: + kvfree(cluster_info); +err: + return ERR_PTR(err); } SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) @@ -3379,6 +3419,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } + error = swap_cgroup_swapon(si->type, maxpages); + if (error) + goto bad_swap_unlock_inode; + + nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, + maxpages, &span); + if (unlikely(nr_extents < 0)) { + error = nr_extents; + goto bad_swap_unlock_inode; + } + if (si->bdev && bdev_stable_writes(si->bdev)) si->flags |= SWP_STABLE_WRITES; @@ -3386,63 +3437,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) si->flags |= SWP_SYNCHRONOUS_IO; if (si->bdev && bdev_nonrot(si->bdev)) { - int cpu, i; - unsigned long ci, nr_cluster; - si->flags |= SWP_SOLIDSTATE; - si->cluster_next_cpu = alloc_percpu(unsigned int); - if (!si->cluster_next_cpu) { - error = -ENOMEM; + + cluster_info = setup_clusters(si, swap_header, maxpages); + if (IS_ERR(cluster_info)) { + error = PTR_ERR(cluster_info); + cluster_info = NULL; goto bad_swap_unlock_inode; } - /* - * select a random position to start with to help wear leveling - * SSD - */ - for_each_possible_cpu(cpu) { - per_cpu(*si->cluster_next_cpu, cpu) = - get_random_u32_inclusive(1, si->highest_bit); - } - nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - - cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info), - GFP_KERNEL); - if (!cluster_info) { - error = -ENOMEM; - goto bad_swap_unlock_inode; - } - - for (ci = 0; ci < nr_cluster; ci++) - spin_lock_init(&((cluster_info + ci)->lock)); - - si->percpu_cluster = alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) { - error = -ENOMEM; - goto bad_swap_unlock_inode; - } - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; - - cluster = per_cpu_ptr(si->percpu_cluster, cpu); - for (i = 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] = SWAP_NEXT_INVALID; - } } else { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap = true; } - error = swap_cgroup_swapon(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, - cluster_info, maxpages, &span); - if (unlikely(nr_extents < 0)) { - error = nr_extents; - goto bad_swap_unlock_inode; - } - if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /* From 435b3894e74256cc93772b7e586d848a5c3433ee Mon Sep 17 00:00:00 2001 From: Zhongkun He Date: Thu, 22 Aug 2024 17:26:12 +0800 Subject: [PATCH 241/416] mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() should_reclaim_retry() is not ALLOC_CPUSET aware and that means that it considers reclaimability of NUMA nodes which are outside of the cpuset. If other nodes have a lot of reclaimable memory then should_reclaim_retry would instruct page allocator to retry even though there is no memory reclaimable on the cpuset nodemask. This is not really a huge problem because the number of retries without any reclaim progress is bound but it could be certainly improved. This is a cold path so this shouldn't really have a measurable impact on performance on most workloads. 1.Test step and the machines. ------------ root@vm:/sys/fs/cgroup/test# numactl -H | grep size node 0 size: 9477 MB node 1 size: 10079 MB node 2 size: 10079 MB node 3 size: 10078 MB root@vm:/sys/fs/cgroup/test# cat cpuset.mems 2 root@vm:/sys/fs/cgroup/test# stress --vm 1 --vm-bytes 12g --vm-keep stress: info: [33430] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd stress: FAIL: [33430] (425) <-- worker 33431 got signal 9 stress: WARN: [33430] (427) now reaping child worker processes stress: FAIL: [33430] (461) failed run completed in 2s 2. reclaim_retry_zone info: We can only alloc pages from node=2, but the reclaim_retry_zone is node=0 and return true. root@vm:/sys/kernel/debug/tracing# cat trace stress-33431 [001] ..... 13223.617311: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=1 wmark_check=1 stress-33431 [001] ..... 13223.617682: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=2 wmark_check=1 stress-33431 [001] ..... 13223.618103: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=3 wmark_check=1 stress-33431 [001] ..... 13223.618454: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=4 wmark_check=1 stress-33431 [001] ..... 13223.618770: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=5 wmark_check=1 stress-33431 [001] ..... 13223.619150: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=6 wmark_check=1 stress-33431 [001] ..... 13223.619510: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=7 wmark_check=1 stress-33431 [001] ..... 13223.619850: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=8 wmark_check=1 stress-33431 [001] ..... 13223.620171: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=9 wmark_check=1 stress-33431 [001] ..... 13223.620533: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=10 wmark_check=1 stress-33431 [001] ..... 13223.620894: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=11 wmark_check=1 stress-33431 [001] ..... 13223.621224: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=12 wmark_check=1 stress-33431 [001] ..... 13223.621551: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=13 wmark_check=1 stress-33431 [001] ..... 13223.621847: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=14 wmark_check=1 stress-33431 [001] ..... 13223.622200: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=15 wmark_check=1 stress-33431 [001] ..... 13223.622580: reclaim_retry_zone: node=0 zone=Normal order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=16 wmark_check=1 With this patch, we can check the right node and get less retry in __alloc_pages_slowpath() because there is nothing to do. Link: https://lkml.kernel.org/r/20240822092612.3209286-1-hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He Suggested-by: Michal Hocko Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Mel Gorman Cc: Zefan Li Signed-off-by: Andrew Morton --- mm/compaction.c | 6 ++++++ mm/page_alloc.c | 5 +++++ 2 files changed, 11 insertions(+) diff --git a/mm/compaction.c b/mm/compaction.c index d1041fbce679..a2b16b08cbbf 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "internal.h" #ifdef CONFIG_COMPACTION @@ -2822,6 +2823,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, ac->highest_zoneidx, ac->nodemask) { enum compact_result status; + if (cpusets_enabled() && + (alloc_flags & ALLOC_CPUSET) && + !__cpuset_zone_allowed(zone, gfp_mask)) + continue; + if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index da603080214c..6dd0b409e3c2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4092,6 +4092,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, unsigned long min_wmark = min_wmark_pages(zone); bool wmark; + if (cpusets_enabled() && + (alloc_flags & ALLOC_CPUSET) && + !__cpuset_zone_allowed(zone, gfp_mask)) + continue; + available = reclaimable = zone_reclaimable_pages(zone); available += zone_page_state_snapshot(zone, NR_FREE_PAGES); From 0ca0c24e3211586d44a31b3c4767fe3f43a008a7 Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 23 Aug 2024 20:04:39 +0100 Subject: [PATCH 242/416] mm: store zero pages to be swapped out in a bitmap Patch series "mm: store zero pages to be swapped out in a bitmap", v8. As shown in the patch series that introduced the zswap same-filled optimization [1], 10-20% of the pages stored in zswap are same-filled. This is also observed across Meta's server fleet. By using VM counters in swap_writepage (not included in this patchseries) it was found that less than 1% of the same-filled pages to be swapped out are non-zero pages. For conventional swap setup (without zswap), rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. When using zswap with swap, this also means that a zswap_entry does not need to be allocated for zero filled pages resulting in memory savings which would offset the memory used for the bitmap. A similar attempt was made earlier in [2] where zswap would only track zero-filled pages instead of same-filled. This patchseries adds zero-filled pages optimization to swap (hence it can be used even if zswap is disabled) and removes the same-filled code from zswap (as only 1% of the same-filled pages are non-zero), simplifying code. [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ [2] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/ This patch (of 2): Approximately 10-20% of pages to be swapped out are zero pages [1]. Rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. With this patch, NVMe writes in Meta server fleet decreased by almost 10% with conventional swap setup (zswap disabled). [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ Link: https://lkml.kernel.org/r/20240823190545.979059-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20240823190545.979059-2-usamaarif642@gmail.com Signed-off-by: Usama Arif Reviewed-by: Chengming Zhou Reviewed-by: Yosry Ahmed Reviewed-by: Nhat Pham Acked-by: Johannes Weiner Cc: David Hildenbrand Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Shakeel Butt Cc: Usama Arif Cc: Andi Kleen Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 + mm/page_io.c | 118 ++++++++++++++++++++++++++++++++++++++++++- mm/swapfile.c | 38 ++++++++++++-- 3 files changed, 151 insertions(+), 6 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 248db1dd7812..ca533b478c21 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -296,6 +296,7 @@ struct swap_info_struct { signed char type; /* strange name for an index */ unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ + unsigned long *zeromap; /* kvmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ struct list_head full_clusters; /* full clusters list */ diff --git a/mm/page_io.c b/mm/page_io.c index a00e2f615118..b6f1519d63b0 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -172,6 +172,80 @@ bad_bmap: goto out; } +static bool is_folio_zero_filled(struct folio *folio) +{ + unsigned int pos, last_pos; + unsigned long *data; + unsigned int i; + + last_pos = PAGE_SIZE / sizeof(*data) - 1; + for (i = 0; i < folio_nr_pages(folio); i++) { + data = kmap_local_folio(folio, i * PAGE_SIZE); + /* + * Check last word first, incase the page is zero-filled at + * the start and has non-zero data at the end, which is common + * in real-world workloads. + */ + if (data[last_pos]) { + kunmap_local(data); + return false; + } + for (pos = 0; pos < last_pos; pos++) { + if (data[pos]) { + kunmap_local(data); + return false; + } + } + kunmap_local(data); + } + + return true; +} + +static void swap_zeromap_folio_set(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + set_bit(swp_offset(entry), sis->zeromap); + } +} + +static void swap_zeromap_folio_clear(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + clear_bit(swp_offset(entry), sis->zeromap); + } +} + +/* + * Return the index of the first subpage which is not zero-filled + * according to swap_info_struct->zeromap. + * If all pages are zero-filled according to zeromap, it will return + * folio_nr_pages(folio). + */ +static unsigned int swap_zeromap_folio_test(struct folio *folio) +{ + struct swap_info_struct *sis = swp_swap_info(folio->swap); + swp_entry_t entry; + unsigned int i; + + for (i = 0; i < folio_nr_pages(folio); i++) { + entry = page_swap_entry(folio_page(folio, i)); + if (!test_bit(swp_offset(entry), sis->zeromap)) + return i; + } + return i; +} + /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. @@ -195,6 +269,25 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) folio_unlock(folio); return ret; } + + /* + * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages. + * The bits in zeromap are protected by the locked swapcache folio + * and atomic updates are used to protect against read-modify-write + * corruption due to other zero swap entries seeing concurrent updates. + */ + if (is_folio_zero_filled(folio)) { + swap_zeromap_folio_set(folio); + folio_unlock(folio); + return 0; + } else { + /* + * Clear bits this folio occupies in the zeromap to prevent + * zero data being read in from any previous zero writes that + * occupied the same swap entries. + */ + swap_zeromap_folio_clear(folio); + } if (zswap_store(folio)) { folio_unlock(folio); return 0; @@ -427,6 +520,26 @@ static void sio_read_complete(struct kiocb *iocb, long ret) mempool_free(sio, sio_pool); } +static bool swap_read_folio_zeromap(struct folio *folio) +{ + unsigned int idx = swap_zeromap_folio_test(folio); + + if (idx == 0) + return false; + + /* + * Swapping in a large folio that is partially in the zeromap is not + * currently handled. Return true without marking the folio uptodate so + * that an IO error is emitted (e.g. do_swap_page() will sigbus). + */ + if (WARN_ON_ONCE(idx < folio_nr_pages(folio))) + return true; + + folio_zero_range(folio, 0, folio_size(folio)); + folio_mark_uptodate(folio); + return true; +} + static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug) { struct swap_info_struct *sis = swp_swap_info(folio->swap); @@ -517,7 +630,10 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) } delayacct_swapin_start(); - if (zswap_load(folio)) { + if (swap_read_folio_zeromap(folio)) { + folio_unlock(folio); + goto finish; + } else if (zswap_load(folio)) { folio_unlock(folio); goto finish; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 03dc77457510..8e317c495bfb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -940,6 +940,14 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, unsigned long begin = offset; unsigned long end = offset + nr_entries - 1; void (*swap_slot_free_notify)(struct block_device *, unsigned long); + unsigned int i; + + /* + * Use atomic clear_bit operations only on zeromap instead of non-atomic + * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes. + */ + for (i = 0; i < nr_entries; i++) + clear_bit(offset + i, si->zeromap); if (offset < si->lowest_bit) si->lowest_bit = offset; @@ -2609,7 +2617,8 @@ static int swap_node(struct swap_info_struct *si) static void setup_swap_info(struct swap_info_struct *si, int prio, unsigned char *swap_map, - struct swap_cluster_info *cluster_info) + struct swap_cluster_info *cluster_info, + unsigned long *zeromap) { int i; @@ -2634,6 +2643,7 @@ static void setup_swap_info(struct swap_info_struct *si, int prio, } si->swap_map = swap_map; si->cluster_info = cluster_info; + si->zeromap = zeromap; } static void _enable_swap_info(struct swap_info_struct *si) @@ -2662,11 +2672,12 @@ static void _enable_swap_info(struct swap_info_struct *si) static void enable_swap_info(struct swap_info_struct *si, int prio, unsigned char *swap_map, - struct swap_cluster_info *cluster_info) + struct swap_cluster_info *cluster_info, + unsigned long *zeromap) { spin_lock(&swap_lock); spin_lock(&si->lock); - setup_swap_info(si, prio, swap_map, cluster_info); + setup_swap_info(si, prio, swap_map, cluster_info, zeromap); spin_unlock(&si->lock); spin_unlock(&swap_lock); /* @@ -2684,7 +2695,7 @@ static void reinsert_swap_info(struct swap_info_struct *si) { spin_lock(&swap_lock); spin_lock(&si->lock); - setup_swap_info(si, si->prio, si->swap_map, si->cluster_info); + setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap); _enable_swap_info(si); spin_unlock(&si->lock); spin_unlock(&swap_lock); @@ -2709,6 +2720,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p = NULL; unsigned char *swap_map; + unsigned long *zeromap; struct swap_cluster_info *cluster_info; struct file *swap_file, *victim; struct address_space *mapping; @@ -2831,6 +2843,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) p->max = 0; swap_map = p->swap_map; p->swap_map = NULL; + zeromap = p->zeromap; + p->zeromap = NULL; cluster_info = p->cluster_info; p->cluster_info = NULL; spin_unlock(&p->lock); @@ -2843,6 +2857,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) free_percpu(p->cluster_next_cpu); p->cluster_next_cpu = NULL; vfree(swap_map); + kvfree(zeromap); kvfree(cluster_info); /* Destroy swap account information */ swap_cgroup_swapoff(p->type); @@ -3340,6 +3355,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) sector_t span; unsigned long maxpages; unsigned char *swap_map = NULL; + unsigned long *zeromap = NULL; struct swap_cluster_info *cluster_info = NULL; struct page *page = NULL; struct inode *inode = NULL; @@ -3430,6 +3446,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } + /* + * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might + * be above MAX_PAGE_ORDER incase of a large swap file. + */ + zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long), + GFP_KERNEL | __GFP_ZERO); + if (!zeromap) { + error = -ENOMEM; + goto bad_swap_unlock_inode; + } + if (si->bdev && bdev_stable_writes(si->bdev)) si->flags |= SWP_STABLE_WRITES; @@ -3505,7 +3532,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (swap_flags & SWAP_FLAG_PREFER) prio = (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; - enable_swap_info(si, prio, swap_map, cluster_info); + enable_swap_info(si, prio, swap_map, cluster_info, zeromap); pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", K(si->pages), name->name, si->prio, nr_extents, @@ -3540,6 +3567,7 @@ bad_swap: si->flags = 0; spin_unlock(&swap_lock); vfree(swap_map); + kvfree(zeromap); kvfree(cluster_info); if (inced_nr_rotate_swap) atomic_dec(&nr_rotate_swap); From 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 23 Aug 2024 20:04:40 +0100 Subject: [PATCH 243/416] mm: remove code to handle same filled pages With an earlier commit to handle zero-filled pages in swap directly, and with only 1% of the same-filled pages being non-zero, zswap no longer needs to handle same-filled pages and can just work on compressed pages. Link: https://lkml.kernel.org/r/20240823190545.979059-3-usamaarif642@gmail.com Signed-off-by: Usama Arif Reviewed-by: Chengming Zhou Acked-by: Yosry Ahmed Reviewed-by: Nhat Pham Acked-by: Johannes Weiner Cc: Andi Kleen Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Shakeel Butt Signed-off-by: Andrew Morton --- mm/zswap.c | 85 +++++------------------------------------------------- 1 file changed, 8 insertions(+), 77 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index df66ab102d27..449914ea9919 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -44,8 +44,6 @@ **********************************/ /* The number of compressed pages currently stored in zswap */ atomic_t zswap_stored_pages = ATOMIC_INIT(0); -/* The number of same-value filled pages currently stored in zswap */ -static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0); /* * The statistics below are not protected from concurrent access for @@ -185,8 +183,7 @@ static struct shrinker *zswap_shrinker; * * swpentry - associated swap entry, the offset indexes into the red-black tree * length - the length in bytes of the compressed page data. Needed during - * decompression. For a same value filled page length is 0, and both - * pool and lru are invalid and must be ignored. + * decompression. * referenced - true if the entry recently entered the zswap pool. Unset by the * writeback logic. The entry is only reclaimed by the writeback * logic if referenced is unset. See comments in the shrinker @@ -202,10 +199,7 @@ struct zswap_entry { unsigned int length; bool referenced; struct zswap_pool *pool; - union { - unsigned long handle; - unsigned long value; - }; + unsigned long handle; struct obj_cgroup *objcg; struct list_head lru; }; @@ -801,13 +795,9 @@ static void zswap_entry_cache_free(struct zswap_entry *entry) */ static void zswap_entry_free(struct zswap_entry *entry) { - if (!entry->length) - atomic_dec(&zswap_same_filled_pages); - else { - zswap_lru_del(&zswap_list_lru, entry); - zpool_free(entry->pool->zpool, entry->handle); - zswap_pool_put(entry->pool); - } + zswap_lru_del(&zswap_list_lru, entry); + zpool_free(entry->pool->zpool, entry->handle); + zswap_pool_put(entry->pool); if (entry->objcg) { obj_cgroup_uncharge_zswap(entry->objcg, entry->length); obj_cgroup_put(entry->objcg); @@ -1277,11 +1267,6 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, * This ensures that the better zswap compresses memory, the fewer * pages we will evict to swap (as it will otherwise incur IO for * relatively small memory saving). - * - * The memory saving factor calculated here takes same-filled pages into - * account, but those are not freeable since they almost occupy no - * space. Hence, we may scale nr_freeable down a little bit more than we - * should if we have a lot of same-filled pages. */ return mult_frac(nr_freeable, nr_backing, nr_stored); } @@ -1416,42 +1401,6 @@ resched: } while (zswap_total_pages() > thr); } -/********************************* -* same-filled functions -**********************************/ -static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value) -{ - unsigned long *data; - unsigned long val; - unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; - bool ret = false; - - data = kmap_local_folio(folio, 0); - val = data[0]; - - if (val != data[last_pos]) - goto out; - - for (pos = 1; pos < last_pos; pos++) { - if (val != data[pos]) - goto out; - } - - *value = val; - ret = true; -out: - kunmap_local(data); - return ret; -} - -static void zswap_fill_folio(struct folio *folio, unsigned long value) -{ - unsigned long *data = kmap_local_folio(folio, 0); - - memset_l(data, value, PAGE_SIZE / sizeof(unsigned long)); - kunmap_local(data); -} - /********************************* * main API **********************************/ @@ -1463,7 +1412,6 @@ bool zswap_store(struct folio *folio) struct zswap_entry *entry, *old; struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; - unsigned long value; VM_WARN_ON_ONCE(!folio_test_locked(folio)); VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); @@ -1496,13 +1444,6 @@ bool zswap_store(struct folio *folio) goto reject; } - if (zswap_is_folio_same_filled(folio, &value)) { - entry->length = 0; - entry->value = value; - atomic_inc(&zswap_same_filled_pages); - goto store_entry; - } - /* if entry is successfully added, it keeps the reference */ entry->pool = zswap_pool_current_get(); if (!entry->pool) @@ -1520,7 +1461,6 @@ bool zswap_store(struct folio *folio) if (!zswap_compress(folio, entry)) goto put_pool; -store_entry: entry->swpentry = swp; entry->objcg = objcg; entry->referenced = true; @@ -1569,13 +1509,9 @@ store_entry: return true; store_failed: - if (!entry->length) - atomic_dec(&zswap_same_filled_pages); - else { - zpool_free(entry->pool->zpool, entry->handle); + zpool_free(entry->pool->zpool, entry->handle); put_pool: - zswap_pool_put(entry->pool); - } + zswap_pool_put(entry->pool); freepage: zswap_entry_cache_free(entry); reject: @@ -1638,10 +1574,7 @@ bool zswap_load(struct folio *folio) if (!entry) return false; - if (entry->length) - zswap_decompress(entry, folio); - else - zswap_fill_folio(folio, entry->value); + zswap_decompress(entry, folio); count_vm_event(ZSWPIN); if (entry->objcg) @@ -1744,8 +1677,6 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, NULL, &total_size_fops); debugfs_create_atomic_t("stored_pages", 0444, zswap_debugfs_root, &zswap_stored_pages); - debugfs_create_atomic_t("same_filled_pages", 0444, - zswap_debugfs_root, &zswap_same_filled_pages); return 0; } From fd06ce2ce4c1f93e3d5d7d2640672ccb1a94cce3 Mon Sep 17 00:00:00 2001 From: Mike Yuan Date: Fri, 23 Aug 2024 16:27:09 +0000 Subject: [PATCH 244/416] selftests: test_zswap: add test for hierarchical zswap.writeback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ensure that zswap.writeback check goes up the cgroup tree, i.e. is hierarchical. Create a subcgroup which has zswap.writeback set to 1, and the upper hierarchy's restrictions shall apply. Link: https://lkml.kernel.org/r/20240823162506.12117-2-me@yhndnzj.com Signed-off-by: Mike Yuan Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- tools/testing/selftests/cgroup/test_zswap.c | 79 +++++++++++++++------ 1 file changed, 56 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c index 190096017f80..40de679248b8 100644 --- a/tools/testing/selftests/cgroup/test_zswap.c +++ b/tools/testing/selftests/cgroup/test_zswap.c @@ -263,15 +263,13 @@ out: static int attempt_writeback(const char *cgroup, void *arg) { long pagesize = sysconf(_SC_PAGESIZE); - char *test_group = arg; size_t memsize = MB(4); char buf[pagesize]; long zswap_usage; - bool wb_enabled; + bool wb_enabled = *(bool *) arg; int ret = -1; char *mem; - wb_enabled = cg_read_long(test_group, "memory.zswap.writeback"); mem = (char *)malloc(memsize); if (!mem) return ret; @@ -288,12 +286,12 @@ static int attempt_writeback(const char *cgroup, void *arg) memcpy(&mem[i], buf, pagesize); /* Try and reclaim allocated memory */ - if (cg_write_numeric(test_group, "memory.reclaim", memsize)) { + if (cg_write_numeric(cgroup, "memory.reclaim", memsize)) { ksft_print_msg("Failed to reclaim all of the requested memory\n"); goto out; } - zswap_usage = cg_read_long(test_group, "memory.zswap.current"); + zswap_usage = cg_read_long(cgroup, "memory.zswap.current"); /* zswpin */ for (int i = 0; i < memsize; i += pagesize) { @@ -303,7 +301,7 @@ static int attempt_writeback(const char *cgroup, void *arg) } } - if (cg_write_numeric(test_group, "memory.zswap.max", zswap_usage/2)) + if (cg_write_numeric(cgroup, "memory.zswap.max", zswap_usage/2)) goto out; /* @@ -312,7 +310,7 @@ static int attempt_writeback(const char *cgroup, void *arg) * If writeback is disabled, memory reclaim will fail as zswap is limited and * it can't writeback to swap. */ - ret = cg_write_numeric(test_group, "memory.reclaim", memsize); + ret = cg_write_numeric(cgroup, "memory.reclaim", memsize); if (!wb_enabled) ret = (ret == -EAGAIN) ? 0 : -1; @@ -321,12 +319,41 @@ out: return ret; } +static int test_zswap_writeback_one(const char *cgroup, bool wb) +{ + long zswpwb_before, zswpwb_after; + + zswpwb_before = get_cg_wb_count(cgroup); + if (zswpwb_before != 0) { + ksft_print_msg("zswpwb_before = %ld instead of 0\n", zswpwb_before); + return -1; + } + + if (cg_run(cgroup, attempt_writeback, (void *) &wb)) + return -1; + + /* Verify that zswap writeback occurred only if writeback was enabled */ + zswpwb_after = get_cg_wb_count(cgroup); + if (zswpwb_after < 0) + return -1; + + if (wb != !!zswpwb_after) { + ksft_print_msg("zswpwb_after is %ld while wb is %s", + zswpwb_after, wb ? "enabled" : "disabled"); + return -1; + } + + return 0; +} + /* Test to verify the zswap writeback path */ static int test_zswap_writeback(const char *root, bool wb) { - long zswpwb_before, zswpwb_after; int ret = KSFT_FAIL; - char *test_group; + char *test_group, *test_group_child = NULL; + + if (cg_read_strcmp(root, "memory.zswap.writeback", "1")) + return KSFT_SKIP; test_group = cg_name(root, "zswap_writeback_test"); if (!test_group) @@ -336,29 +363,35 @@ static int test_zswap_writeback(const char *root, bool wb) if (cg_write(test_group, "memory.zswap.writeback", wb ? "1" : "0")) goto out; - zswpwb_before = get_cg_wb_count(test_group); - if (zswpwb_before != 0) { - ksft_print_msg("zswpwb_before = %ld instead of 0\n", zswpwb_before); - goto out; - } - - if (cg_run(test_group, attempt_writeback, (void *) test_group)) + if (test_zswap_writeback_one(test_group, wb)) goto out; - /* Verify that zswap writeback occurred only if writeback was enabled */ - zswpwb_after = get_cg_wb_count(test_group); - if (zswpwb_after < 0) + /* Reset memory.zswap.max to max (modified by attempt_writeback), and + * set up child cgroup, whose memory.zswap.writeback is hardcoded to 1. + * Thus, the parent's setting shall be what's in effect. */ + if (cg_write(test_group, "memory.zswap.max", "max")) + goto out; + if (cg_write(test_group, "cgroup.subtree_control", "+memory")) goto out; - if (wb != !!zswpwb_after) { - ksft_print_msg("zswpwb_after is %ld while wb is %s", - zswpwb_after, wb ? "enabled" : "disabled"); + test_group_child = cg_name(test_group, "zswap_writeback_test_child"); + if (!test_group_child) + goto out; + if (cg_create(test_group_child)) + goto out; + if (cg_write(test_group_child, "memory.zswap.writeback", "1")) + goto out; + + if (test_zswap_writeback_one(test_group_child, wb)) goto out; - } ret = KSFT_PASS; out: + if (test_group_child) { + cg_destroy(test_group_child); + free(test_group_child); + } cg_destroy(test_group); free(test_group); return ret; From 5a53623d0fe658b1b8699632ac8ed7dd0d9b2e97 Mon Sep 17 00:00:00 2001 From: Mike Yuan Date: Fri, 23 Aug 2024 16:27:11 +0000 Subject: [PATCH 245/416] Documentation/cgroup-v2: clarify that zswap.writeback is ignored if zswap is disabled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit As discussed in [1], zswap-related settings natually lose their effect when zswap is disabled, specifically zswap.writeback here. Be explicit about this behavior. [1] https://lore.kernel.org/linux-kernel/CAKEwX=Mhbwhh-=xxCU-RjMXS_n=RpV3Gtznb2m_3JgL+jzz++g@mail.gmail.com/ [akpm@linux-foundation.org: fix/simplify text] Link: https://lkml.kernel.org/r/20240823162506.12117-3-me@yhndnzj.com Signed-off-by: Mike Yuan Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index d3344218010c..3dad62b1e11c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1755,6 +1755,8 @@ The following nested keys are defined. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be written to the zswap pool. + This setting has no effect if zswap is disabled, and swapping + is allowed unless memory.swap.max is set to 0. memory.pressure A read-only nested-keyed file. From b7012d513f81959596a01083415597edb04cf509 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:41 -0400 Subject: [PATCH 246/416] mm/vma: correctly position vma_iterator in __split_vma() Patch series "Avoid MAP_FIXED gap exposure", v8. It is now possible to walk the vma tree using the rcu read locks and is beneficial to do so to reduce lock contention. Doing so while a MAP_FIXED mapping is executing means that a reader may see a gap in the vma tree that should never logically exist - and does not when using the mmap lock in read mode. The temporal gap exists because mmap_region() calls munmap() prior to installing the new mapping. This patch set stops rcu readers from seeing the temporal gap by splitting up the munmap() function into two parts. The first part prepares the vma tree for modifications by doing the necessary splits and tracks the vmas marked for removal in a side tree. The second part completes the munmapping of the vmas after the vma tree has been overwritten (either by a MAP_FIXED replacement vma or by a NULL in the munmap() case). Please note that rcu walkers will still be able to see a temporary state of split vmas that may be in the process of being removed, but the temporal gap will not be exposed. vma_start_write() are called on both parts of the split vma, so this state is detectable. If existing vmas have a vm_ops->close(), then they will be called prior to mapping the new vmas (and ptes are cleared out). Without calling ->close(), hugetlbfs tests fail (hugemmap06 specifically) due to resources still being marked as 'busy'. Unfortunately, calling the corresponding ->open() may not restore the state of the vmas, so it is safer to keep the existing failure scenario where a gap is inserted and never replaced. The failure scenario is in its own patch (0015) for traceability. This patch (of 21): The vma iterator may be left pointing to the newly created vma. This happens when inserting the new vma at the end of the old vma (!new_below). The incorrect position in the vma iterator is not exposed currently since the vma iterator is repositioned in the munmap path and is not reused in any of the other paths. This has limited impact in the current code, but is required for future changes. Link: https://lkml.kernel.org/r/20240830040101.822209-2-Liam.Howlett@oracle.com Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()") Signed-off-by: Liam R. Howlett Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Sidhartha Kumar Cc: Vlastimil Babka Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Paul Moore Signed-off-by: Andrew Morton --- mm/vma.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/vma.c b/mm/vma.c index 5850f7c0949b..066de79b7b73 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -177,7 +177,7 @@ void unmap_region(struct mm_struct *mm, struct ma_state *mas, /* * __split_vma() bypasses sysctl_max_map_count checking. We use this where it * has already been checked or doesn't make sense to fail. - * VMA Iterator will point to the end VMA. + * VMA Iterator will point to the original VMA. */ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) @@ -246,6 +246,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, /* Success. */ if (new_below) vma_next(vmi); + else + vma_prev(vmi); + return 0; out_free_mpol: From 7e7b2370ed0551d2fa8e4bd6e4bbd603fef3cdcd Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:42 -0400 Subject: [PATCH 247/416] mm/vma: introduce abort_munmap_vmas() Extract clean up of failed munmap() operations from do_vmi_align_munmap(). This simplifies later patches in the series. It is worth noting that the mas_for_each() loop now has a different upper limit. This should not change the number of vmas visited for reattaching to the main vma tree (mm_mt), as all vmas are reattached in both scenarios. Link: https://lkml.kernel.org/r/20240830040101.822209-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 066de79b7b73..58ecd447670d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -668,6 +668,22 @@ again: validate_mm(mm); } +/* + * abort_munmap_vmas - Undo any munmap work and free resources + * + * Reattach any detached vmas and free up the maple tree used to track the vmas. + */ +static inline void abort_munmap_vmas(struct ma_state *mas_detach) +{ + struct vm_area_struct *vma; + + mas_set(mas_detach, 0); + mas_for_each(mas_detach, vma, ULONG_MAX) + vma_mark_detached(vma, false); + + __mt_destroy(mas_detach->tree); +} + /* * do_vmi_align_munmap() - munmap the aligned region from @start to @end. * @vmi: The vma iterator @@ -834,11 +850,7 @@ clear_tree_failed: userfaultfd_error: munmap_gather_failed: end_split_failed: - mas_set(&mas_detach, 0); - mas_for_each(&mas_detach, next, end) - vma_mark_detached(next, false); - - __mt_destroy(&mt_detach); + abort_munmap_vmas(&mas_detach); start_split_failed: map_count_exceeded: validate_mm(mm); From 01cf21e9e119575a5a334ec19c8b2ef0c5d44c3c Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:43 -0400 Subject: [PATCH 248/416] mm/vma: introduce vmi_complete_munmap_vmas() Extract all necessary operations that need to be completed after the vma maple tree is updated from a munmap() operation. Extracting this makes the later patch in the series easier to understand. Link: https://lkml.kernel.org/r/20240830040101.822209-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 80 ++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 55 insertions(+), 25 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 58ecd447670d..3a2098464b8f 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -684,6 +684,58 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach) __mt_destroy(mas_detach->tree); } +/* + * vmi_complete_munmap_vmas() - Finish the munmap() operation + * @vmi: The vma iterator + * @vma: The first vma to be munmapped + * @mm: The mm struct + * @start: The start address + * @end: The end address + * @unlock: Unlock the mm or not + * @mas_detach: them maple state of the detached vma maple tree + * @locked_vm: The locked_vm count in the detached vmas + * + * This function updates the mm_struct, unmaps the region, frees the resources + * used for the munmap() and may downgrade the lock - if requested. Everything + * needed to be done once the vma maple tree is updated. + */ +static void +vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct mm_struct *mm, unsigned long start, unsigned long end, + bool unlock, struct ma_state *mas_detach, + unsigned long locked_vm) +{ + struct vm_area_struct *prev, *next; + int count; + + count = mas_detach->index + 1; + mm->map_count -= count; + mm->locked_vm -= locked_vm; + if (unlock) + mmap_write_downgrade(mm); + + prev = vma_iter_prev_range(vmi); + next = vma_next(vmi); + if (next) + vma_iter_prev_range(vmi); + + /* + * We can free page tables without write-locking mmap_lock because VMAs + * were isolated before we downgraded mmap_lock. + */ + mas_set(mas_detach, 1); + unmap_region(mm, mas_detach, vma, prev, next, start, end, count, + !unlock); + /* Statistics and freeing VMAs */ + mas_set(mas_detach, 0); + remove_mt(mm, mas_detach); + validate_mm(mm); + if (unlock) + mmap_read_unlock(mm); + + __mt_destroy(mas_detach->tree); +} + /* * do_vmi_align_munmap() - munmap the aligned region from @start to @end. * @vmi: The vma iterator @@ -703,7 +755,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf, bool unlock) { - struct vm_area_struct *prev, *next = NULL; + struct vm_area_struct *next = NULL; struct maple_tree mt_detach; int count = 0; int error = -ENOMEM; @@ -818,31 +870,9 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, goto clear_tree_failed; /* Point of no return */ - mm->locked_vm -= locked_vm; - mm->map_count -= count; - if (unlock) - mmap_write_downgrade(mm); + vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach, + locked_vm); - prev = vma_iter_prev_range(vmi); - next = vma_next(vmi); - if (next) - vma_iter_prev_range(vmi); - - /* - * We can free page tables without write-locking mmap_lock because VMAs - * were isolated before we downgraded mmap_lock. - */ - mas_set(&mas_detach, 1); - unmap_region(mm, &mas_detach, vma, prev, next, start, end, count, - !unlock); - /* Statistics and freeing VMAs */ - mas_set(&mas_detach, 0); - remove_mt(mm, &mas_detach); - validate_mm(mm); - if (unlock) - mmap_read_unlock(mm); - - __mt_destroy(&mt_detach); return 0; modify_vma_failed: From 6898c9039bc8e3027ae0fcd0f05fc2b82ccc8be0 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:44 -0400 Subject: [PATCH 249/416] mm/vma: extract the gathering of vmas from do_vmi_align_munmap() Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a detached maple tree for removal later. Part of the gathering is the splitting of vmas that span the boundary. Link: https://lkml.kernel.org/r/20240830040101.822209-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 95 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 62 insertions(+), 33 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 3a2098464b8f..f691c1db5b12 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -737,32 +737,30 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, } /* - * do_vmi_align_munmap() - munmap the aligned region from @start to @end. + * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree + * for removal at a later date. Handles splitting first and last if necessary + * and marking the vmas as isolated. + * * @vmi: The vma iterator * @vma: The starting vm_area_struct * @mm: The mm_struct * @start: The aligned start address to munmap. * @end: The aligned end address to munmap. * @uf: The userfaultfd list_head - * @unlock: Set to true to drop the mmap_lock. unlocking only happens on - * success. + * @mas_detach: The maple state tracking the detached tree + * @locked_vm: a pointer to store the VM_LOCKED pages count. * - * Return: 0 on success and drops the lock if so directed, error and leaves the - * lock held otherwise. + * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise */ -int -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, +static int +vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, - unsigned long end, struct list_head *uf, bool unlock) + unsigned long end, struct list_head *uf, + struct ma_state *mas_detach, unsigned long *locked_vm) { struct vm_area_struct *next = NULL; - struct maple_tree mt_detach; int count = 0; int error = -ENOMEM; - unsigned long locked_vm = 0; - MA_STATE(mas_detach, &mt_detach, 0, 0); - mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); - mt_on_stack(mt_detach); /* * If we need to split any vma, do it now to save pain later. @@ -789,8 +787,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, goto start_split_failed; } - error = __split_vma(vmi, vma, start, 1); - if (error) + if (__split_vma(vmi, vma, start, 1)) goto start_split_failed; } @@ -807,20 +804,18 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, /* Does it split the end? */ if (next->vm_end > end) { - error = __split_vma(vmi, next, end, 0); - if (error) + if (__split_vma(vmi, next, end, 0)) goto end_split_failed; } vma_start_write(next); - mas_set(&mas_detach, count); - error = mas_store_gfp(&mas_detach, next, GFP_KERNEL); - if (error) + mas_set(mas_detach, count++); + if (mas_store_gfp(mas_detach, next, GFP_KERNEL)) goto munmap_gather_failed; + vma_mark_detached(next, true); if (next->vm_flags & VM_LOCKED) - locked_vm += vma_pages(next); + *locked_vm += vma_pages(next); - count++; if (unlikely(uf)) { /* * If userfaultfd_unmap_prep returns an error the vmas @@ -831,9 +826,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, * split, despite we could. This is unlikely enough * failure that it's not worth optimizing it for. */ - error = userfaultfd_unmap_prep(next, start, end, uf); - - if (error) + if (userfaultfd_unmap_prep(next, start, end, uf)) goto userfaultfd_error; } #ifdef CONFIG_DEBUG_VM_MAPLE_TREE @@ -845,7 +838,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) /* Make sure no VMAs are about to be lost. */ { - MA_STATE(test, &mt_detach, 0, 0); + MA_STATE(test, mas_detach->tree, 0, 0); struct vm_area_struct *vma_mas, *vma_test; int test_count = 0; @@ -865,6 +858,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, while (vma_iter_addr(vmi) > start) vma_iter_prev_range(vmi); + return 0; + +userfaultfd_error: +munmap_gather_failed: +end_split_failed: +modify_vma_failed: + abort_munmap_vmas(mas_detach); +start_split_failed: +map_count_exceeded: + return error; +} + +/* + * do_vmi_align_munmap() - munmap the aligned region from @start to @end. + * @vmi: The vma iterator + * @vma: The starting vm_area_struct + * @mm: The mm_struct + * @start: The aligned start address to munmap. + * @end: The aligned end address to munmap. + * @uf: The userfaultfd list_head + * @unlock: Set to true to drop the mmap_lock. unlocking only happens on + * success. + * + * Return: 0 on success and drops the lock if so directed, error and leaves the + * lock held otherwise. + */ +int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct mm_struct *mm, unsigned long start, unsigned long end, + struct list_head *uf, bool unlock) +{ + struct maple_tree mt_detach; + MA_STATE(mas_detach, &mt_detach, 0, 0); + mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); + mt_on_stack(mt_detach); + int error; + unsigned long locked_vm = 0; + + error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf, + &mas_detach, &locked_vm); + if (error) + goto gather_failed; + error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL); if (error) goto clear_tree_failed; @@ -872,17 +907,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, /* Point of no return */ vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach, locked_vm); - return 0; -modify_vma_failed: clear_tree_failed: -userfaultfd_error: -munmap_gather_failed: -end_split_failed: abort_munmap_vmas(&mas_detach); -start_split_failed: -map_count_exceeded: +gather_failed: validate_mm(mm); return error; } From dba14840905f9ecaad0b3a261e4d7a88120c8c7a Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:45 -0400 Subject: [PATCH 250/416] mm/vma: introduce vma_munmap_struct for use in munmap operations Use a structure to pass along all the necessary information and counters involved in removing vmas from the mm_struct. Update vmi_ function names to vms_ to indicate the first argument type change. Link: https://lkml.kernel.org/r/20240830040101.822209-6-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Suren Baghdasaryan Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 140 +++++++++++++++++++++++++++++-------------------------- mm/vma.h | 16 +++++++ 2 files changed, 90 insertions(+), 66 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index f691c1db5b12..f24b52a87458 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -80,6 +80,32 @@ static void init_multi_vma_prep(struct vma_prepare *vp, } +/* + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct + * @vms: The vma munmap struct + * @vmi: The vma iterator + * @vma: The first vm_area_struct to munmap + * @start: The aligned start address to munmap + * @end: The aligned end address to munmap + * @uf: The userfaultfd list_head + * @unlock: Unlock after the operation. Only unlocked on success + */ +static inline void init_vma_munmap(struct vma_munmap_struct *vms, + struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, struct list_head *uf, + bool unlock) +{ + vms->vmi = vmi; + vms->vma = vma; + vms->mm = vma->vm_mm; + vms->start = start; + vms->end = end; + vms->unlock = unlock; + vms->uf = uf; + vms->vma_count = 0; + vms->nr_pages = vms->locked_vm = 0; +} + /* * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) * in front of (at a lower virtual address and file offset than) the vma. @@ -685,81 +711,62 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach) } /* - * vmi_complete_munmap_vmas() - Finish the munmap() operation - * @vmi: The vma iterator - * @vma: The first vma to be munmapped - * @mm: The mm struct - * @start: The start address - * @end: The end address - * @unlock: Unlock the mm or not - * @mas_detach: them maple state of the detached vma maple tree - * @locked_vm: The locked_vm count in the detached vmas + * vms_complete_munmap_vmas() - Finish the munmap() operation + * @vms: The vma munmap struct + * @mas_detach: The maple state of the detached vmas * - * This function updates the mm_struct, unmaps the region, frees the resources + * This updates the mm_struct, unmaps the region, frees the resources * used for the munmap() and may downgrade the lock - if requested. Everything * needed to be done once the vma maple tree is updated. */ -static void -vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, - struct mm_struct *mm, unsigned long start, unsigned long end, - bool unlock, struct ma_state *mas_detach, - unsigned long locked_vm) +static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach) { struct vm_area_struct *prev, *next; - int count; + struct mm_struct *mm; - count = mas_detach->index + 1; - mm->map_count -= count; - mm->locked_vm -= locked_vm; - if (unlock) + mm = vms->mm; + mm->map_count -= vms->vma_count; + mm->locked_vm -= vms->locked_vm; + if (vms->unlock) mmap_write_downgrade(mm); - prev = vma_iter_prev_range(vmi); - next = vma_next(vmi); + prev = vma_iter_prev_range(vms->vmi); + next = vma_next(vms->vmi); if (next) - vma_iter_prev_range(vmi); + vma_iter_prev_range(vms->vmi); /* * We can free page tables without write-locking mmap_lock because VMAs * were isolated before we downgraded mmap_lock. */ mas_set(mas_detach, 1); - unmap_region(mm, mas_detach, vma, prev, next, start, end, count, - !unlock); + unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end, + vms->vma_count, !vms->unlock); /* Statistics and freeing VMAs */ mas_set(mas_detach, 0); remove_mt(mm, mas_detach); validate_mm(mm); - if (unlock) + if (vms->unlock) mmap_read_unlock(mm); __mt_destroy(mas_detach->tree); } /* - * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree + * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree * for removal at a later date. Handles splitting first and last if necessary * and marking the vmas as isolated. * - * @vmi: The vma iterator - * @vma: The starting vm_area_struct - * @mm: The mm_struct - * @start: The aligned start address to munmap. - * @end: The aligned end address to munmap. - * @uf: The userfaultfd list_head + * @vms: The vma munmap struct * @mas_detach: The maple state tracking the detached tree - * @locked_vm: a pointer to store the VM_LOCKED pages count. * * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise */ -static int -vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, - struct mm_struct *mm, unsigned long start, - unsigned long end, struct list_head *uf, - struct ma_state *mas_detach, unsigned long *locked_vm) +static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach) { struct vm_area_struct *next = NULL; - int count = 0; int error = -ENOMEM; /* @@ -771,23 +778,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, */ /* Does it split the first one? */ - if (start > vma->vm_start) { + if (vms->start > vms->vma->vm_start) { /* * Make sure that map_count on return from munmap() will * not exceed its limit; but let map_count go just above * its limit temporarily, to help free resources as expected. */ - if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) + if (vms->end < vms->vma->vm_end && + vms->mm->map_count >= sysctl_max_map_count) goto map_count_exceeded; /* Don't bother splitting the VMA if we can't unmap it anyway */ - if (!can_modify_vma(vma)) { + if (!can_modify_vma(vms->vma)) { error = -EPERM; goto start_split_failed; } - if (__split_vma(vmi, vma, start, 1)) + if (__split_vma(vms->vmi, vms->vma, vms->start, 1)) goto start_split_failed; } @@ -795,7 +803,7 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, * Detach a range of VMAs from the mm. Using next as a temp variable as * it is always overwritten. */ - next = vma; + next = vms->vma; do { if (!can_modify_vma(next)) { error = -EPERM; @@ -803,20 +811,20 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, } /* Does it split the end? */ - if (next->vm_end > end) { - if (__split_vma(vmi, next, end, 0)) + if (next->vm_end > vms->end) { + if (__split_vma(vms->vmi, next, vms->end, 0)) goto end_split_failed; } vma_start_write(next); - mas_set(mas_detach, count++); + mas_set(mas_detach, vms->vma_count++); if (mas_store_gfp(mas_detach, next, GFP_KERNEL)) goto munmap_gather_failed; vma_mark_detached(next, true); if (next->vm_flags & VM_LOCKED) - *locked_vm += vma_pages(next); + vms->locked_vm += vma_pages(next); - if (unlikely(uf)) { + if (unlikely(vms->uf)) { /* * If userfaultfd_unmap_prep returns an error the vmas * will remain split, but userland will get a @@ -826,14 +834,15 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, * split, despite we could. This is unlikely enough * failure that it's not worth optimizing it for. */ - if (userfaultfd_unmap_prep(next, start, end, uf)) + if (userfaultfd_unmap_prep(next, vms->start, vms->end, + vms->uf)) goto userfaultfd_error; } #ifdef CONFIG_DEBUG_VM_MAPLE_TREE - BUG_ON(next->vm_start < start); - BUG_ON(next->vm_start > end); + BUG_ON(next->vm_start < vms->start); + BUG_ON(next->vm_start > vms->end); #endif - } for_each_vma_range(*vmi, next, end); + } for_each_vma_range(*(vms->vmi), next, vms->end); #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) /* Make sure no VMAs are about to be lost. */ @@ -842,21 +851,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma, struct vm_area_struct *vma_mas, *vma_test; int test_count = 0; - vma_iter_set(vmi, start); + vma_iter_set(vms->vmi, vms->start); rcu_read_lock(); - vma_test = mas_find(&test, count - 1); - for_each_vma_range(*vmi, vma_mas, end) { + vma_test = mas_find(&test, vms->vma_count - 1); + for_each_vma_range(*(vms->vmi), vma_mas, vms->end) { BUG_ON(vma_mas != vma_test); test_count++; - vma_test = mas_next(&test, count - 1); + vma_test = mas_next(&test, vms->vma_count - 1); } rcu_read_unlock(); - BUG_ON(count != test_count); + BUG_ON(vms->vma_count != test_count); } #endif - while (vma_iter_addr(vmi) > start) - vma_iter_prev_range(vmi); + while (vma_iter_addr(vms->vmi) > vms->start) + vma_iter_prev_range(vms->vmi); return 0; @@ -892,11 +901,11 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, MA_STATE(mas_detach, &mt_detach, 0, 0); mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); mt_on_stack(mt_detach); + struct vma_munmap_struct vms; int error; - unsigned long locked_vm = 0; - error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf, - &mas_detach, &locked_vm); + init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock); + error = vms_gather_munmap_vmas(&vms, &mas_detach); if (error) goto gather_failed; @@ -905,8 +914,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, goto clear_tree_failed; /* Point of no return */ - vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach, - locked_vm); + vms_complete_munmap_vmas(&vms, &mas_detach); return 0; clear_tree_failed: diff --git a/mm/vma.h b/mm/vma.h index da31d0f62157..cb67acf59012 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -26,6 +26,22 @@ struct unlink_vma_file_batch { struct vm_area_struct *vmas[8]; }; +/* + * vma munmap operation + */ +struct vma_munmap_struct { + struct vma_iterator *vmi; + struct mm_struct *mm; + struct vm_area_struct *vma; /* The first vma to munmap */ + struct list_head *uf; /* Userfaultfd list_head */ + unsigned long start; /* Aligned start addr (inclusive) */ + unsigned long end; /* Aligned end addr (exclusive) */ + int vma_count; /* Number of vmas that will be removed */ + unsigned long nr_pages; /* Number of pages being removed */ + unsigned long locked_vm; /* Number of locked pages */ + bool unlock; /* Unlock after the munmap */ +}; + #ifdef CONFIG_DEBUG_VM_MAPLE_TREE void validate_mm(struct mm_struct *mm); #else From 17f1ae9b40c6b03760ca937c53e7f0d38c2613a2 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:46 -0400 Subject: [PATCH 251/416] mm/vma: change munmap to use vma_munmap_struct() for accounting and surrounding vmas Clean up the code by changing the munmap operation to use a structure for the accounting and munmap variables. Since remove_mt() is only called in one location and the contents will be reduced to almost nothing. The remains of the function can be added to vms_complete_munmap_vmas(). Link: https://lkml.kernel.org/r/20240830040101.822209-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 83 +++++++++++++++++++++++++++++--------------------------- mm/vma.h | 6 ++++ 2 files changed, 49 insertions(+), 40 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index f24b52a87458..6d042cd46cdb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -103,7 +103,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, vms->unlock = unlock; vms->uf = uf; vms->vma_count = 0; - vms->nr_pages = vms->locked_vm = 0; + vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0; + vms->exec_vm = vms->stack_vm = vms->data_vm = 0; } /* @@ -299,30 +300,6 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return __split_vma(vmi, vma, addr, new_below); } -/* - * Ok - we have the memory areas we should free on a maple tree so release them, - * and do the vma updates. - * - * Called with the mm semaphore held. - */ -static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas) -{ - unsigned long nr_accounted = 0; - struct vm_area_struct *vma; - - /* Update high watermark before we lower total_vm */ - update_hiwater_vm(mm); - mas_for_each(mas, vma, ULONG_MAX) { - long nrpages = vma_pages(vma); - - if (vma->vm_flags & VM_ACCOUNT) - nr_accounted += nrpages; - vm_stat_account(mm, vma->vm_flags, -nrpages); - remove_vma(vma, false); - } - vm_unacct_memory(nr_accounted); -} - /* * init_vma_prep() - Initializer wrapper for vma_prepare struct * @vp: The vma_prepare struct @@ -722,7 +699,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach) static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach) { - struct vm_area_struct *prev, *next; + struct vm_area_struct *vma; struct mm_struct *mm; mm = vms->mm; @@ -731,21 +708,31 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, if (vms->unlock) mmap_write_downgrade(mm); - prev = vma_iter_prev_range(vms->vmi); - next = vma_next(vms->vmi); - if (next) - vma_iter_prev_range(vms->vmi); - /* * We can free page tables without write-locking mmap_lock because VMAs * were isolated before we downgraded mmap_lock. */ mas_set(mas_detach, 1); - unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end, - vms->vma_count, !vms->unlock); - /* Statistics and freeing VMAs */ + unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next, + vms->start, vms->end, vms->vma_count, !vms->unlock); + /* Update high watermark before we lower total_vm */ + update_hiwater_vm(mm); + /* Stat accounting */ + WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages); + /* Paranoid bookkeeping */ + VM_WARN_ON(vms->exec_vm > mm->exec_vm); + VM_WARN_ON(vms->stack_vm > mm->stack_vm); + VM_WARN_ON(vms->data_vm > mm->data_vm); + mm->exec_vm -= vms->exec_vm; + mm->stack_vm -= vms->stack_vm; + mm->data_vm -= vms->data_vm; + + /* Remove and clean up vmas */ mas_set(mas_detach, 0); - remove_mt(mm, mas_detach); + mas_for_each(mas_detach, vma, ULONG_MAX) + remove_vma(vma, false); + + vm_unacct_memory(vms->nr_accounted); validate_mm(mm); if (vms->unlock) mmap_read_unlock(mm); @@ -798,18 +785,19 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, if (__split_vma(vms->vmi, vms->vma, vms->start, 1)) goto start_split_failed; } + vms->prev = vma_prev(vms->vmi); /* * Detach a range of VMAs from the mm. Using next as a temp variable as * it is always overwritten. */ - next = vms->vma; - do { + for_each_vma_range(*(vms->vmi), next, vms->end) { + long nrpages; + if (!can_modify_vma(next)) { error = -EPERM; goto modify_vma_failed; } - /* Does it split the end? */ if (next->vm_end > vms->end) { if (__split_vma(vms->vmi, next, vms->end, 0)) @@ -821,8 +809,21 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, goto munmap_gather_failed; vma_mark_detached(next, true); + nrpages = vma_pages(next); + + vms->nr_pages += nrpages; if (next->vm_flags & VM_LOCKED) - vms->locked_vm += vma_pages(next); + vms->locked_vm += nrpages; + + if (next->vm_flags & VM_ACCOUNT) + vms->nr_accounted += nrpages; + + if (is_exec_mapping(next->vm_flags)) + vms->exec_vm += nrpages; + else if (is_stack_mapping(next->vm_flags)) + vms->stack_vm += nrpages; + else if (is_data_mapping(next->vm_flags)) + vms->data_vm += nrpages; if (unlikely(vms->uf)) { /* @@ -842,7 +843,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, BUG_ON(next->vm_start < vms->start); BUG_ON(next->vm_start > vms->end); #endif - } for_each_vma_range(*(vms->vmi), next, vms->end); + } + + vms->next = vma_next(vms->vmi); #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) /* Make sure no VMAs are about to be lost. */ diff --git a/mm/vma.h b/mm/vma.h index cb67acf59012..cbf55e0e0c4f 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -33,12 +33,18 @@ struct vma_munmap_struct { struct vma_iterator *vmi; struct mm_struct *mm; struct vm_area_struct *vma; /* The first vma to munmap */ + struct vm_area_struct *prev; /* vma before the munmap area */ + struct vm_area_struct *next; /* vma after the munmap area */ struct list_head *uf; /* Userfaultfd list_head */ unsigned long start; /* Aligned start addr (inclusive) */ unsigned long end; /* Aligned end addr (exclusive) */ int vma_count; /* Number of vmas that will be removed */ unsigned long nr_pages; /* Number of pages being removed */ unsigned long locked_vm; /* Number of locked pages */ + unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */ + unsigned long exec_vm; + unsigned long stack_vm; + unsigned long data_vm; bool unlock; /* Unlock after the munmap */ }; From 89b2d2a57eb97eec2e782976844995f4dd189998 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:47 -0400 Subject: [PATCH 252/416] mm/vma: extract validate_mm() from vma_complete() vma_complete() will need to be called during an unsafe time to call validate_mm(). Extract the call in all places now so that only one location can be modified in the next change. Link: https://lkml.kernel.org/r/20240830040101.822209-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 1 + mm/vma.c | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index 548bc45a27bf..dce1cc74ecdb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1796,6 +1796,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_iter_store(vmi, vma); vma_complete(&vp, vmi, mm); + validate_mm(mm); khugepaged_enter_vma(vma, flags); goto out; } diff --git a/mm/vma.c b/mm/vma.c index 6d042cd46cdb..4e08c1654bdd 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -269,6 +269,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, /* vma_complete stores the new vma */ vma_complete(&vp, vmi, vma->vm_mm); + validate_mm(vma->vm_mm); /* Success. */ if (new_below) @@ -548,6 +549,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_iter_store(vmi, vma); vma_complete(&vp, vmi, vma->vm_mm); + validate_mm(vma->vm_mm); return 0; nomem: @@ -589,6 +591,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_iter_clear(vmi); vma_set_range(vma, start, end, pgoff); vma_complete(&vp, vmi, vma->vm_mm); + validate_mm(vma->vm_mm); return 0; } @@ -668,7 +671,6 @@ again: } if (vp->insert && vp->file) uprobe_mmap(vp->insert); - validate_mm(mm); } /* @@ -1197,6 +1199,7 @@ static struct vm_area_struct } vma_complete(&vp, vmi, mm); + validate_mm(mm); khugepaged_enter_vma(res, vm_flags); return res; From c7c0c3c30f4ecb2d5a9869166e52ad23c0b4ec89 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:48 -0400 Subject: [PATCH 253/416] mm/vma: inline munmap operation in mmap_region() mmap_region is already passed sanitized addr and len, so change the call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other checks. The inlining of the function and checks is an intermediate step in the series so future patches are easier to follow. Link: https://lkml.kernel.org/r/20240830040101.822209-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index dce1cc74ecdb..ec72f05b05f2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1388,12 +1388,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return -ENOMEM; } - /* Unmap any existing mapping in the area */ - error = do_vmi_munmap(&vmi, mm, addr, len, uf, false); - if (error == -EPERM) - return error; - else if (error) - return -ENOMEM; + /* Find the first overlapping VMA */ + vma = vma_find(&vmi, end); + if (vma) { + /* Unmap any existing mapping in the area */ + error = do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false); + if (error) + return error; + vma = NULL; + } /* * Private writable mapping: check memory availability From 9014b230d88d7fc505ada416163b151be70b649a Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:49 -0400 Subject: [PATCH 254/416] mm/vma: expand mmap_region() munmap call Open code the do_vmi_align_munmap() call so that it can be broken up later in the series. This requires exposing a few more vma operations. Link: https://lkml.kernel.org/r/20240830040101.822209-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 26 ++++++++++++++++++++++---- mm/vma.c | 31 ++----------------------------- mm/vma.h | 33 +++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index ec72f05b05f2..84cb4b1df4a2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1366,6 +1366,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct vm_area_struct *next, *prev, *merge; pgoff_t pglen = len >> PAGE_SHIFT; unsigned long charged = 0; + struct vma_munmap_struct vms; + struct ma_state mas_detach; + struct maple_tree mt_detach; unsigned long end = addr + len; unsigned long merge_start = addr, merge_end = end; bool writable_file_mapping = false; @@ -1391,11 +1394,28 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); if (vma) { - /* Unmap any existing mapping in the area */ - error = do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false); + mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); + mt_on_stack(mt_detach); + mas_init(&mas_detach, &mt_detach, /* addr = */ 0); + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); + /* Prepare to unmap any existing mapping in the area */ + error = vms_gather_munmap_vmas(&vms, &mas_detach); if (error) return error; + + /* Remove any existing mappings from the vma tree */ + if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL)) + return -ENOMEM; + + /* Unmap any existing mapping in the area */ + vms_complete_munmap_vmas(&vms, &mas_detach); + next = vms.next; + prev = vms.prev; + vma_prev(&vmi); vma = NULL; + } else { + next = vma_next(&vmi); + prev = vma_prev(&vmi); } /* @@ -1408,8 +1428,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_flags |= VM_ACCOUNT; } - next = vma_next(&vmi); - prev = vma_prev(&vmi); if (vm_flags & VM_SPECIAL) { if (prev) vma_iter_next_range(&vmi); diff --git a/mm/vma.c b/mm/vma.c index 4e08c1654bdd..fc425eb34bf7 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -80,33 +80,6 @@ static void init_multi_vma_prep(struct vma_prepare *vp, } -/* - * init_vma_munmap() - Initializer wrapper for vma_munmap_struct - * @vms: The vma munmap struct - * @vmi: The vma iterator - * @vma: The first vm_area_struct to munmap - * @start: The aligned start address to munmap - * @end: The aligned end address to munmap - * @uf: The userfaultfd list_head - * @unlock: Unlock after the operation. Only unlocked on success - */ -static inline void init_vma_munmap(struct vma_munmap_struct *vms, - struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, struct list_head *uf, - bool unlock) -{ - vms->vmi = vmi; - vms->vma = vma; - vms->mm = vma->vm_mm; - vms->start = start; - vms->end = end; - vms->unlock = unlock; - vms->uf = uf; - vms->vma_count = 0; - vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0; - vms->exec_vm = vms->stack_vm = vms->data_vm = 0; -} - /* * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff) * in front of (at a lower virtual address and file offset than) the vma. @@ -698,7 +671,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach) * used for the munmap() and may downgrade the lock - if requested. Everything * needed to be done once the vma maple tree is updated. */ -static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, +void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach) { struct vm_area_struct *vma; @@ -752,7 +725,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, * * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise */ -static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, +int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach) { struct vm_area_struct *next = NULL; diff --git a/mm/vma.h b/mm/vma.h index cbf55e0e0c4f..e78b24d1cf83 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -78,6 +78,39 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); +/* + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct + * @vms: The vma munmap struct + * @vmi: The vma iterator + * @vma: The first vm_area_struct to munmap + * @start: The aligned start address to munmap + * @end: The aligned end address to munmap + * @uf: The userfaultfd list_head + * @unlock: Unlock after the operation. Only unlocked on success + */ +static inline void init_vma_munmap(struct vma_munmap_struct *vms, + struct vma_iterator *vmi, struct vm_area_struct *vma, + unsigned long start, unsigned long end, struct list_head *uf, + bool unlock) +{ + vms->vmi = vmi; + vms->vma = vma; + vms->mm = vma->vm_mm; + vms->start = start; + vms->end = end; + vms->unlock = unlock; + vms->uf = uf; + vms->vma_count = 0; + vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0; + vms->exec_vm = vms->stack_vm = vms->data_vm = 0; +} + +int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach); + +void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach); + int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, From 58e60f8284276d46709a5768dfd6d0e57f04cf45 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:50 -0400 Subject: [PATCH 255/416] mm/vma: support vma == NULL in init_vma_munmap() Adding support for a NULL vma means the init_vma_munmap() can be initialized for a less error-prone process when calling vms_complete_munmap_vmas() later on. Link: https://lkml.kernel.org/r/20240830040101.822209-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 2 +- mm/vma.h | 11 ++++++++--- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 84cb4b1df4a2..ac348ae933ba 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1393,11 +1393,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); if (vma) { mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK); mt_on_stack(mt_detach); mas_init(&mas_detach, &mt_detach, /* addr = */ 0); - init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); /* Prepare to unmap any existing mapping in the area */ error = vms_gather_munmap_vmas(&vms, &mas_detach); if (error) diff --git a/mm/vma.h b/mm/vma.h index e78b24d1cf83..0e214bbf443e 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -95,9 +95,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, { vms->vmi = vmi; vms->vma = vma; - vms->mm = vma->vm_mm; - vms->start = start; - vms->end = end; + if (vma) { + vms->mm = vma->vm_mm; + vms->start = start; + vms->end = end; + } else { + vms->mm = NULL; + vms->start = vms->end = 0; + } vms->unlock = unlock; vms->uf = uf; vms->vma_count = 0; From d744f4acb81ae2f2c33bce71da1f65be32ed1d65 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:51 -0400 Subject: [PATCH 256/416] mm/mmap: reposition vma iterator in mmap_region() Instead of moving (or leaving) the vma iterator pointing at the previous vma, leave it pointing at the insert location. Pointing the vma iterator at the insert location allows for a cleaner walk of the vma tree for MAP_FIXED and the no expansion cases. The vma_prev() call in the case of merging the previous vma is equivalent to vma_iter_prev_range(), since the vma iterator will be pointing to the location just before the previous vma. This change needs to export abort_munmap_vmas() from mm/vma. Link: https://lkml.kernel.org/r/20240830040101.822209-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 40 +++++++++++++++++++++++----------------- mm/vma.c | 16 ---------------- mm/vma.h | 16 ++++++++++++++++ 3 files changed, 39 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index ac348ae933ba..08cf9199f314 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1401,21 +1401,23 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Prepare to unmap any existing mapping in the area */ error = vms_gather_munmap_vmas(&vms, &mas_detach); if (error) - return error; + goto gather_failed; /* Remove any existing mappings from the vma tree */ - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL)) - return -ENOMEM; + error = vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL); + if (error) + goto clear_tree_failed; /* Unmap any existing mapping in the area */ vms_complete_munmap_vmas(&vms, &mas_detach); next = vms.next; prev = vms.prev; - vma_prev(&vmi); vma = NULL; } else { next = vma_next(&vmi); prev = vma_prev(&vmi); + if (prev) + vma_iter_next_range(&vmi); } /* @@ -1428,11 +1430,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_flags |= VM_ACCOUNT; } - if (vm_flags & VM_SPECIAL) { - if (prev) - vma_iter_next_range(&vmi); + if (vm_flags & VM_SPECIAL) goto cannot_expand; - } /* Attempt to expand an old mapping */ /* Check next */ @@ -1453,19 +1452,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr, merge_start = prev->vm_start; vma = prev; vm_pgoff = prev->vm_pgoff; - } else if (prev) { - vma_iter_next_range(&vmi); + vma_prev(&vmi); /* Equivalent to going to the previous range */ } - /* Actually expand, if possible */ - if (vma && - !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) { - khugepaged_enter_vma(vma, vm_flags); - goto expanded; + if (vma) { + /* Actually expand, if possible */ + if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) { + khugepaged_enter_vma(vma, vm_flags); + goto expanded; + } + + /* If the expand fails, then reposition the vma iterator */ + if (unlikely(vma == prev)) + vma_iter_set(&vmi, addr); } - if (vma == prev) - vma_iter_set(&vmi, addr); cannot_expand: /* @@ -1624,6 +1625,11 @@ free_vma: unacct_error: if (charged) vm_unacct_memory(charged); + +clear_tree_failed: + if (vms.vma_count) + abort_munmap_vmas(&mas_detach); +gather_failed: validate_mm(mm); return error; } diff --git a/mm/vma.c b/mm/vma.c index fc425eb34bf7..5e0ed5d63877 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -646,22 +646,6 @@ again: uprobe_mmap(vp->insert); } -/* - * abort_munmap_vmas - Undo any munmap work and free resources - * - * Reattach any detached vmas and free up the maple tree used to track the vmas. - */ -static inline void abort_munmap_vmas(struct ma_state *mas_detach) -{ - struct vm_area_struct *vma; - - mas_set(mas_detach, 0); - mas_for_each(mas_detach, vma, ULONG_MAX) - vma_mark_detached(vma, false); - - __mt_destroy(mas_detach->tree); -} - /* * vms_complete_munmap_vmas() - Finish the munmap() operation * @vms: The vma munmap struct diff --git a/mm/vma.h b/mm/vma.h index 0e214bbf443e..c85fc7c888a8 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -116,6 +116,22 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach); +/* + * abort_munmap_vmas - Undo any munmap work and free resources + * + * Reattach any detached vmas and free up the maple tree used to track the vmas. + */ +static inline void abort_munmap_vmas(struct ma_state *mas_detach) +{ + struct vm_area_struct *vma; + + mas_set(mas_detach, 0); + mas_for_each(mas_detach, vma, ULONG_MAX) + vma_mark_detached(vma, false); + + __mt_destroy(mas_detach->tree); +} + int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, From 9c3ebeda8fb5a8e9e82ab9364ec3d4b80cd0ec3d Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:52 -0400 Subject: [PATCH 257/416] mm/vma: track start and end for munmap in vma_munmap_struct Set the start and end address for munmap when the prev and next are gathered. This is needed to avoid incorrect addresses being used during the vms_complete_munmap_vmas() function if the prev/next vma are expanded. Add a new helper vms_complete_pte_clear(), which is needed later and will avoid growing the argument list to unmap_region() beyond the 9 it already has. Link: https://lkml.kernel.org/r/20240830040101.822209-13-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 32 +++++++++++++++++++++++++------- mm/vma.h | 6 ++++++ 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 5e0ed5d63877..c71dda026191 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -646,6 +646,26 @@ again: uprobe_mmap(vp->insert); } +static void vms_complete_pte_clear(struct vma_munmap_struct *vms, + struct ma_state *mas_detach, bool mm_wr_locked) +{ + struct mmu_gather tlb; + + /* + * We can free page tables without write-locking mmap_lock because VMAs + * were isolated before we downgraded mmap_lock. + */ + mas_set(mas_detach, 1); + lru_add_drain(); + tlb_gather_mmu(&tlb, vms->mm); + update_hiwater_rss(vms->mm); + unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked); + mas_set(mas_detach, 1); + /* start and end may be different if there is no prev or next vma. */ + free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked); + tlb_finish_mmu(&tlb); +} + /* * vms_complete_munmap_vmas() - Finish the munmap() operation * @vms: The vma munmap struct @@ -667,13 +687,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, if (vms->unlock) mmap_write_downgrade(mm); - /* - * We can free page tables without write-locking mmap_lock because VMAs - * were isolated before we downgraded mmap_lock. - */ - mas_set(mas_detach, 1); - unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next, - vms->start, vms->end, vms->vma_count, !vms->unlock); + vms_complete_pte_clear(vms, mas_detach, !vms->unlock); /* Update high watermark before we lower total_vm */ update_hiwater_vm(mm); /* Stat accounting */ @@ -745,6 +759,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, goto start_split_failed; } vms->prev = vma_prev(vms->vmi); + if (vms->prev) + vms->unmap_start = vms->prev->vm_end; /* * Detach a range of VMAs from the mm. Using next as a temp variable as @@ -805,6 +821,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, } vms->next = vma_next(vms->vmi); + if (vms->next) + vms->unmap_end = vms->next->vm_start; #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) /* Make sure no VMAs are about to be lost. */ diff --git a/mm/vma.h b/mm/vma.h index c85fc7c888a8..8afaa661224d 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -38,6 +38,8 @@ struct vma_munmap_struct { struct list_head *uf; /* Userfaultfd list_head */ unsigned long start; /* Aligned start addr (inclusive) */ unsigned long end; /* Aligned end addr (exclusive) */ + unsigned long unmap_start; /* Unmap PTE start */ + unsigned long unmap_end; /* Unmap PTE end */ int vma_count; /* Number of vmas that will be removed */ unsigned long nr_pages; /* Number of pages being removed */ unsigned long locked_vm; /* Number of locked pages */ @@ -78,6 +80,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); +#ifdef CONFIG_MMU /* * init_vma_munmap() - Initializer wrapper for vma_munmap_struct * @vms: The vma munmap struct @@ -108,7 +111,10 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, vms->vma_count = 0; vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0; vms->exec_vm = vms->stack_vm = vms->data_vm = 0; + vms->unmap_start = FIRST_USER_ADDRESS; + vms->unmap_end = USER_PGTABLES_CEILING; } +#endif int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach); From 94f59ea591f17d5fb77f68e820b27522596a7e9e Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:53 -0400 Subject: [PATCH 258/416] mm: clean up unmap_region() argument list With the only caller to unmap_region() being the error path of mmap_region(), the argument list can be significantly reduced. Link: https://lkml.kernel.org/r/20240830040101.822209-14-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 3 +-- mm/vma.c | 17 ++++++++--------- mm/vma.h | 6 ++---- 3 files changed, 11 insertions(+), 15 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 08cf9199f314..304dc085533a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1615,8 +1615,7 @@ unmap_and_free_vma: vma_iter_set(&vmi, vma->vm_end); /* Undo any partial mapping done by a device driver. */ - unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start, - vma->vm_end, vma->vm_end, true); + unmap_region(&vmi.mas, vma, prev, next); } if (writable_file_mapping) mapping_unmap_writable(file->f_mapping); diff --git a/mm/vma.c b/mm/vma.c index c71dda026191..83c5c46c67b9 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -155,22 +155,21 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable) * * Called with the mm semaphore held. */ -void unmap_region(struct mm_struct *mm, struct ma_state *mas, - struct vm_area_struct *vma, struct vm_area_struct *prev, - struct vm_area_struct *next, unsigned long start, - unsigned long end, unsigned long tree_end, bool mm_wr_locked) +void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, + struct vm_area_struct *prev, struct vm_area_struct *next) { + struct mm_struct *mm = vma->vm_mm; struct mmu_gather tlb; - unsigned long mt_start = mas->index; lru_add_drain(); tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); - unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked); - mas_set(mas, mt_start); + unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end, + /* mm_wr_locked = */ true); + mas_set(mas, vma->vm_end); free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, - next ? next->vm_start : USER_PGTABLES_CEILING, - mm_wr_locked); + next ? next->vm_start : USER_PGTABLES_CEILING, + /* mm_wr_locked = */ true); tlb_finish_mmu(&tlb); } diff --git a/mm/vma.h b/mm/vma.h index 8afaa661224d..82ba48174341 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -149,10 +149,8 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, void remove_vma(struct vm_area_struct *vma, bool unreachable); -void unmap_region(struct mm_struct *mm, struct ma_state *mas, - struct vm_area_struct *vma, struct vm_area_struct *prev, - struct vm_area_struct *next, unsigned long start, - unsigned long end, unsigned long tree_end, bool mm_wr_locked); +void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, + struct vm_area_struct *prev, struct vm_area_struct *next); /* Required by mmap_region(). */ bool From f8d112a4e657c65c888e6b8a8435ef61a66e4ab8 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:54 -0400 Subject: [PATCH 259/416] mm/mmap: avoid zeroing vma tree in mmap_region() Instead of zeroing the vma tree and then overwriting the area, let the area be overwritten and then clean up the gathered vmas using vms_complete_munmap_vmas(). To ensure locking is downgraded correctly, the mm is set regardless of MAP_FIXED or not (NULL vma). If a driver is mapping over an existing vma, then clear the ptes before the call_mmap() invocation. This is done using the vms_clean_up_area() helper. If there is a close vm_ops, that must also be called to ensure any cleanup is done before mapping over the area. This also means that calling open has been added to the abort of an unmap operation, for now. Since vm_ops->open() and vm_ops->close() are not always undo each other (state cleanup may exist in ->close() that is lost forever), the code cannot be left in this way, but that change has been isolated to another commit to make this point very obvious for traceability. Temporarily keep track of the number of pages that will be removed and reduce the charged amount. This also drops the validate_mm() call in the vma_expand() function. It is necessary to drop the validate as it would fail since the mm map_count would be incorrect during a vma expansion, prior to the cleanup from vms_complete_munmap_vmas(). Clean up the error handing of the vms_gather_munmap_vmas() by calling the verification within the function. Link: https://lkml.kernel.org/r/20240830040101.822209-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 57 ++++++++++++++++++++++++++----------------------------- mm/vma.c | 54 ++++++++++++++++++++++++++++++++++++++++------------ mm/vma.h | 22 +++++++++++++++------ 3 files changed, 85 insertions(+), 48 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 304dc085533a..405e0432c78e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1373,23 +1373,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long merge_start = addr, merge_end = end; bool writable_file_mapping = false; pgoff_t vm_pgoff; - int error; + int error = -ENOMEM; VMA_ITERATOR(vmi, mm, addr); + unsigned long nr_pages, nr_accounted; - /* Check against address space limit. */ - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) { - unsigned long nr_pages; + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted); - /* - * MAP_FIXED may remove pages of mappings that intersects with - * requested mapping. Account for the pages it would unmap. - */ - nr_pages = count_vma_pages_range(mm, addr, end); - - if (!may_expand_vm(mm, vm_flags, - (len >> PAGE_SHIFT) - nr_pages)) - return -ENOMEM; - } + /* + * Check against address space limit. + * MAP_FIXED may remove pages of mappings that intersects with requested + * mapping. Account for the pages it would unmap. + */ + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages)) + return -ENOMEM; /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); @@ -1403,13 +1399,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (error) goto gather_failed; - /* Remove any existing mappings from the vma tree */ - error = vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL); - if (error) - goto clear_tree_failed; - - /* Unmap any existing mapping in the area */ - vms_complete_munmap_vmas(&vms, &mas_detach); next = vms.next; prev = vms.prev; vma = NULL; @@ -1425,8 +1414,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ if (accountable_mapping(file, vm_flags)) { charged = len >> PAGE_SHIFT; + charged -= nr_accounted; if (security_vm_enough_memory_mm(mm, charged)) - return -ENOMEM; + goto abort_munmap; + vms.nr_accounted = 0; vm_flags |= VM_ACCOUNT; } @@ -1475,10 +1466,8 @@ cannot_expand: * not unmapped, but the maps are removed from the list. */ vma = vm_area_alloc(mm); - if (!vma) { - error = -ENOMEM; + if (!vma) goto unacct_error; - } vma_iter_config(&vmi, addr, end); vma_set_range(vma, addr, end, pgoff); @@ -1487,6 +1476,11 @@ cannot_expand: if (file) { vma->vm_file = get_file(file); + /* + * call_mmap() may map PTE, so ensure there are no existing PTEs + * call the vm_ops close function if one exists. + */ + vms_clean_up_area(&vms, &mas_detach, true); error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; @@ -1577,6 +1571,9 @@ unmap_writable: expanded: perf_event_mmap(vma); + /* Unmap any existing mapping in the area */ + vms_complete_munmap_vmas(&vms, &mas_detach); + vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) || @@ -1605,7 +1602,7 @@ expanded: return addr; close_and_free_vma: - if (file && vma->vm_ops && vma->vm_ops->close) + if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); if (file || vma->vm_file) { @@ -1625,9 +1622,9 @@ unacct_error: if (charged) vm_unacct_memory(charged); -clear_tree_failed: - if (vms.vma_count) - abort_munmap_vmas(&mas_detach); +abort_munmap: + if (vms.nr_pages) + abort_munmap_vmas(&mas_detach, vms.closed_vm_ops); gather_failed: validate_mm(mm); return error; @@ -1960,7 +1957,7 @@ void exit_mmap(struct mm_struct *mm) do { if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); - remove_vma(vma, true); + remove_vma(vma, /* unreachable = */ true, /* closed = */ false); count++; cond_resched(); vma = vma_next(&vmi); diff --git a/mm/vma.c b/mm/vma.c index 83c5c46c67b9..648c58da8ad4 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -136,10 +136,10 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, /* * Close a vm structure and free it. */ -void remove_vma(struct vm_area_struct *vma, bool unreachable) +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed) { might_sleep(); - if (vma->vm_ops && vma->vm_ops->close) + if (!closed && vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); if (vma->vm_file) fput(vma->vm_file); @@ -521,7 +521,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, vma_iter_store(vmi, vma); vma_complete(&vp, vmi, vma->vm_mm); - validate_mm(vma->vm_mm); return 0; nomem: @@ -645,11 +644,14 @@ again: uprobe_mmap(vp->insert); } -static void vms_complete_pte_clear(struct vma_munmap_struct *vms, - struct ma_state *mas_detach, bool mm_wr_locked) +static inline void vms_clear_ptes(struct vma_munmap_struct *vms, + struct ma_state *mas_detach, bool mm_wr_locked) { struct mmu_gather tlb; + if (!vms->clear_ptes) /* Nothing to do */ + return; + /* * We can free page tables without write-locking mmap_lock because VMAs * were isolated before we downgraded mmap_lock. @@ -658,11 +660,31 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms, lru_add_drain(); tlb_gather_mmu(&tlb, vms->mm); update_hiwater_rss(vms->mm); - unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked); + unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, + vms->vma_count, mm_wr_locked); + mas_set(mas_detach, 1); /* start and end may be different if there is no prev or next vma. */ - free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked); + free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, + vms->unmap_end, mm_wr_locked); tlb_finish_mmu(&tlb); + vms->clear_ptes = false; +} + +void vms_clean_up_area(struct vma_munmap_struct *vms, + struct ma_state *mas_detach, bool mm_wr_locked) +{ + struct vm_area_struct *vma; + + if (!vms->nr_pages) + return; + + vms_clear_ptes(vms, mas_detach, mm_wr_locked); + mas_set(mas_detach, 0); + mas_for_each(mas_detach, vma, ULONG_MAX) + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + vms->closed_vm_ops = true; } /* @@ -686,7 +708,10 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, if (vms->unlock) mmap_write_downgrade(mm); - vms_complete_pte_clear(vms, mas_detach, !vms->unlock); + if (!vms->nr_pages) + return; + + vms_clear_ptes(vms, mas_detach, !vms->unlock); /* Update high watermark before we lower total_vm */ update_hiwater_vm(mm); /* Stat accounting */ @@ -702,7 +727,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, /* Remove and clean up vmas */ mas_set(mas_detach, 0); mas_for_each(mas_detach, vma, ULONG_MAX) - remove_vma(vma, false); + remove_vma(vma, /* = */ false, vms->closed_vm_ops); vm_unacct_memory(vms->nr_accounted); validate_mm(mm); @@ -846,13 +871,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, while (vma_iter_addr(vms->vmi) > vms->start) vma_iter_prev_range(vms->vmi); + vms->clear_ptes = true; return 0; userfaultfd_error: munmap_gather_failed: end_split_failed: modify_vma_failed: - abort_munmap_vmas(mas_detach); + abort_munmap_vmas(mas_detach, /* closed = */ false); start_split_failed: map_count_exceeded: return error; @@ -897,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, return 0; clear_tree_failed: - abort_munmap_vmas(&mas_detach); + abort_munmap_vmas(&mas_detach, /* closed = */ false); gather_failed: validate_mm(mm); return error; @@ -1615,17 +1641,21 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) } unsigned long count_vma_pages_range(struct mm_struct *mm, - unsigned long addr, unsigned long end) + unsigned long addr, unsigned long end, + unsigned long *nr_accounted) { VMA_ITERATOR(vmi, mm, addr); struct vm_area_struct *vma; unsigned long nr_pages = 0; + *nr_accounted = 0; for_each_vma_range(vmi, vma, end) { unsigned long vm_start = max(addr, vma->vm_start); unsigned long vm_end = min(end, vma->vm_end); nr_pages += PHYS_PFN(vm_end - vm_start); + if (vma->vm_flags & VM_ACCOUNT) + *nr_accounted += PHYS_PFN(vm_end - vm_start); } return nr_pages; diff --git a/mm/vma.h b/mm/vma.h index 82ba48174341..64b44f5a0a11 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -48,6 +48,8 @@ struct vma_munmap_struct { unsigned long stack_vm; unsigned long data_vm; bool unlock; /* Unlock after the munmap */ + bool clear_ptes; /* If there are outstanding PTE to be cleared */ + bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */ }; #ifdef CONFIG_DEBUG_VM_MAPLE_TREE @@ -96,14 +98,13 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, unsigned long start, unsigned long end, struct list_head *uf, bool unlock) { + vms->mm = current->mm; vms->vmi = vmi; vms->vma = vma; if (vma) { - vms->mm = vma->vm_mm; vms->start = start; vms->end = end; } else { - vms->mm = NULL; vms->start = vms->end = 0; } vms->unlock = unlock; @@ -113,6 +114,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, vms->exec_vm = vms->stack_vm = vms->data_vm = 0; vms->unmap_start = FIRST_USER_ADDRESS; vms->unmap_end = USER_PGTABLES_CEILING; + vms->clear_ptes = false; + vms->closed_vm_ops = false; } #endif @@ -122,18 +125,24 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach); +void vms_clean_up_area(struct vma_munmap_struct *vms, + struct ma_state *mas_detach, bool mm_wr_locked); + /* * abort_munmap_vmas - Undo any munmap work and free resources * * Reattach any detached vmas and free up the maple tree used to track the vmas. */ -static inline void abort_munmap_vmas(struct ma_state *mas_detach) +static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed) { struct vm_area_struct *vma; mas_set(mas_detach, 0); - mas_for_each(mas_detach, vma, ULONG_MAX) + mas_for_each(mas_detach, vma, ULONG_MAX) { vma_mark_detached(vma, false); + if (closed && vma->vm_ops && vma->vm_ops->open) + vma->vm_ops->open(vma); + } __mt_destroy(mas_detach->tree); } @@ -147,7 +156,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool unlock); -void remove_vma(struct vm_area_struct *vma, bool unreachable); +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed); void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, struct vm_area_struct *prev, struct vm_area_struct *next); @@ -261,7 +270,8 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); int mm_take_all_locks(struct mm_struct *mm); void mm_drop_all_locks(struct mm_struct *mm); unsigned long count_vma_pages_range(struct mm_struct *mm, - unsigned long addr, unsigned long end); + unsigned long addr, unsigned long end, + unsigned long *nr_accounted); static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) { From 4f87153e82c4906e917d273ab7accd0d540aab35 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:55 -0400 Subject: [PATCH 260/416] mm: change failure of MAP_FIXED to restoring the gap on failure Prior to call_mmap(), the vmas that will be replaced need to clear the way for what may happen in the call_mmap(). This clean up work includes clearing the ptes and calling the close() vm_ops. Some users do more setup than can be restored by calling the vm_ops open() function. It is safer to store the gap in the vma tree in these cases. That is to say that the failure scenario that existed before the MAP_FIXED gap exposure is restored as it is safer than trying to undo a partial mapping. Since abort_munmap_vmas() is only reattaching vmas with this change, the function is renamed to reattach_vmas(). There is also a secondary failure that may occur if there is not enough memory to store the gap. In this case, the vmas are reattached and resources freed. If the system cannot complete the call_mmap() and fails to allocate with GFP_KERNEL, then the system will print a warning about the failure. [lorenzo.stoakes@oracle.com: fix off-by-one error in vms_abort_munmap_vmas()] Link: https://lkml.kernel.org/r/52ee7eb3-955c-4ade-b5f0-28fed8ba3d0b@lucifer.local Link: https://lkml.kernel.org/r/20240830040101.822209-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Signed-off-by: Lorenzo Stoakes Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 3 +-- mm/vma.c | 4 +-- mm/vma.h | 80 ++++++++++++++++++++++++++++++++++++++++--------------- 3 files changed, 61 insertions(+), 26 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 405e0432c78e..e1e5c78b6c3c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1623,8 +1623,7 @@ unacct_error: vm_unacct_memory(charged); abort_munmap: - if (vms.nr_pages) - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops); + vms_abort_munmap_vmas(&vms, &mas_detach); gather_failed: validate_mm(mm); return error; diff --git a/mm/vma.c b/mm/vma.c index 648c58da8ad4..d2d71d659d1e 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -878,7 +878,7 @@ userfaultfd_error: munmap_gather_failed: end_split_failed: modify_vma_failed: - abort_munmap_vmas(mas_detach, /* closed = */ false); + reattach_vmas(mas_detach); start_split_failed: map_count_exceeded: return error; @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, return 0; clear_tree_failed: - abort_munmap_vmas(&mas_detach, /* closed = */ false); + reattach_vmas(&mas_detach); gather_failed: validate_mm(mm); return error; diff --git a/mm/vma.h b/mm/vma.h index 64b44f5a0a11..37b4b5d00b99 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); +static inline int vma_iter_store_gfp(struct vma_iterator *vmi, + struct vm_area_struct *vma, gfp_t gfp) + +{ + if (vmi->mas.status != ma_start && + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) + vma_iter_invalidate(vmi); + + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); + mas_store_gfp(&vmi->mas, vma, gfp); + if (unlikely(mas_is_err(&vmi->mas))) + return -ENOMEM; + + return 0; +} + #ifdef CONFIG_MMU /* * init_vma_munmap() - Initializer wrapper for vma_munmap_struct @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms, struct ma_state *mas_detach, bool mm_wr_locked); /* - * abort_munmap_vmas - Undo any munmap work and free resources + * reattach_vmas() - Undo any munmap work and free resources + * @mas_detach: The maple state with the detached maple tree * * Reattach any detached vmas and free up the maple tree used to track the vmas. */ -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed) +static inline void reattach_vmas(struct ma_state *mas_detach) { struct vm_area_struct *vma; mas_set(mas_detach, 0); - mas_for_each(mas_detach, vma, ULONG_MAX) { + mas_for_each(mas_detach, vma, ULONG_MAX) vma_mark_detached(vma, false); - if (closed && vma->vm_ops && vma->vm_ops->open) - vma->vm_ops->open(vma); - } __mt_destroy(mas_detach->tree); } +/* + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap() + * operation. + * @vms: The vma unmap structure + * @mas_detach: The maple state with the detached maple tree + * + * Reattach any detached vmas, free up the maple tree used to track the vmas. + * If that's not possible because the ptes are cleared (and vm_ops->closed() may + * have been called), then a NULL is written over the vmas and the vmas are + * removed (munmap() completed). + */ +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, + struct ma_state *mas_detach) +{ + struct ma_state *mas = &vms->vmi->mas; + if (!vms->nr_pages) + return; + + if (vms->clear_ptes) + return reattach_vmas(mas_detach); + + /* + * Aborting cannot just call the vm_ops open() because they are often + * not symmetrical and state data has been lost. Resort to the old + * failure method of leaving a gap where the MAP_FIXED mapping failed. + */ + mas_set_range(mas, vms->start, vms->end - 1); + if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", + current->comm, current->pid); + /* Leaving vmas detached and in-tree may hamper recovery */ + reattach_vmas(mas_detach); + } else { + /* Clean up the insertion of the unfortunate gap */ + vms_complete_munmap_vmas(vms, mas_detach); + } +} + int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, struct mm_struct *mm, unsigned long start, @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, return mas_prev(&vmi->mas, min); } -static inline int vma_iter_store_gfp(struct vma_iterator *vmi, - struct vm_area_struct *vma, gfp_t gfp) -{ - if (vmi->mas.status != ma_start && - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start))) - vma_iter_invalidate(vmi); - - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1); - mas_store_gfp(&vmi->mas, vma, gfp); - if (unlikely(mas_is_err(&vmi->mas))) - return -ENOMEM; - - return 0; -} - - /* * These three helpers classifies VMAs for virtual memory accounting. */ From 5972d97c44dc00fd9436a885182adef00ae8cb22 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:56 -0400 Subject: [PATCH 261/416] mm/mmap: use PHYS_PFN in mmap_region() Instead of shifting the length by PAGE_SIZE, use PHYS_PFN. Also use the existing local variable everywhere instead of some of the time. Link: https://lkml.kernel.org/r/20240830040101.822209-17-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index e1e5c78b6c3c..cd09dd164e85 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1364,7 +1364,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; struct vm_area_struct *next, *prev, *merge; - pgoff_t pglen = len >> PAGE_SHIFT; + pgoff_t pglen = PHYS_PFN(len); unsigned long charged = 0; struct vma_munmap_struct vms; struct ma_state mas_detach; @@ -1384,7 +1384,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * MAP_FIXED may remove pages of mappings that intersects with requested * mapping. Account for the pages it would unmap. */ - if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages)) + if (!may_expand_vm(mm, vm_flags, pglen - nr_pages)) return -ENOMEM; /* Find the first overlapping VMA */ @@ -1413,7 +1413,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * Private writable mapping: check memory availability */ if (accountable_mapping(file, vm_flags)) { - charged = len >> PAGE_SHIFT; + charged = pglen; charged -= nr_accounted; if (security_vm_enough_memory_mm(mm, charged)) goto abort_munmap; @@ -1574,14 +1574,14 @@ expanded: /* Unmap any existing mapping in the area */ vms_complete_munmap_vmas(&vms, &mas_detach); - vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); + vm_stat_account(mm, vm_flags, pglen); if (vm_flags & VM_LOCKED) { if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm)) vm_flags_clear(vma, VM_LOCKED_MASK); else - mm->locked_vm += (len >> PAGE_SHIFT); + mm->locked_vm += pglen; } if (file) From 13d77e0133908721f7da093ffd3169a92bae8b11 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:57 -0400 Subject: [PATCH 262/416] mm/mmap: use vms accounted pages in mmap_region() Change from nr_pages variable to vms.nr_accounted for the charged pages calculation. This is necessary for a future patch. This also avoids checking security_vm_enough_memory_mm() if the amount of memory won't change. Link: https://lkml.kernel.org/r/20240830040101.822209-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Kees Cook Reviewed-by: Lorenzo Stoakes Reviewed-by: Suren Baghdasaryan Acked-by: Paul Moore [LSM] Cc: Kees Cook Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Sidhartha Kumar Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index cd09dd164e85..4faadc54e89d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1414,9 +1414,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ if (accountable_mapping(file, vm_flags)) { charged = pglen; - charged -= nr_accounted; - if (security_vm_enough_memory_mm(mm, charged)) + charged -= vms.nr_accounted; + if (charged && security_vm_enough_memory_mm(mm, charged)) goto abort_munmap; + vms.nr_accounted = 0; vm_flags |= VM_ACCOUNT; } From 63fc66f5b6b18f39269a66cf34d8cb7a24fbfe88 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:58 -0400 Subject: [PATCH 263/416] ipc/shm, mm: drop do_vma_munmap() The do_vma_munmap() wrapper existed for callers that didn't have a vma iterator and needed to check the vma mseal status prior to calling the underlying munmap(). All callers now use a vma iterator and since the mseal check has been moved to do_vmi_align_munmap() and the vmas are aligned, this function can just be called instead. do_vmi_align_munmap() can no longer be static as ipc/shm is using it and it is exported via the mm.h header. Link: https://lkml.kernel.org/r/20240830040101.822209-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 6 +++--- ipc/shm.c | 8 ++++---- mm/mmap.c | 33 ++++++--------------------------- mm/vma.c | 12 ++++++------ mm/vma.h | 4 +--- 5 files changed, 20 insertions(+), 43 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f6053d7ea6f..b0ff06d18c71 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3287,14 +3287,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr, extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool unlock); +int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct mm_struct *mm, unsigned long start, + unsigned long end, struct list_head *uf, bool unlock); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior); #ifdef CONFIG_MMU -extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, - struct list_head *uf, bool unlock); extern int __mm_populate(unsigned long addr, unsigned long len, int ignore_errors); static inline void mm_populate(unsigned long addr, unsigned long len) diff --git a/ipc/shm.c b/ipc/shm.c index 3e3071252dac..99564c870084 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr) */ file = vma->vm_file; size = i_size_read(file_inode(vma->vm_file)); - do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end, - NULL, false); + do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start, + vma->vm_end, NULL, false); /* * We discovered the size of the shm segment, so * break out of here and fall through to the next @@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr) if ((vma->vm_ops == &shm_vm_ops) && ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) && (vma->vm_file == file)) { - do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end, - NULL, false); + do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start, + vma->vm_end, NULL, false); } vma = vma_next(&vmi); diff --git a/mm/mmap.c b/mm/mmap.c index 4faadc54e89d..6485f2300692 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -169,11 +169,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) goto out; /* mapping intersects with an existing non-brk vma. */ /* * mm->brk must be protected by write mmap_lock. - * do_vma_munmap() will drop the lock on success, so update it - * before calling do_vma_munmap(). + * do_vmi_align_munmap() will drop the lock on success, so + * update it before calling do_vma_munmap(). */ mm->brk = brk; - if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true)) + if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf, + /* unlock = */ true)) goto out; goto success_unlocked; @@ -1479,9 +1480,9 @@ cannot_expand: vma->vm_file = get_file(file); /* * call_mmap() may map PTE, so ensure there are no existing PTEs - * call the vm_ops close function if one exists. + * and call the vm_ops close function if one exists. */ - vms_clean_up_area(&vms, &mas_detach, true); + vms_clean_up_area(&vms, &mas_detach); error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; @@ -1744,28 +1745,6 @@ out: return ret; } -/* - * do_vma_munmap() - Unmap a full or partial vma. - * @vmi: The vma iterator pointing at the vma - * @vma: The first vma to be munmapped - * @start: the start of the address to unmap - * @end: The end of the address to unmap - * @uf: The userfaultfd list_head - * @unlock: Drop the lock on success - * - * unmaps a VMA mapping when the vma iterator is already in position. - * Does not handle alignment. - * - * Return: 0 on success drops the lock of so directed, error on failure and will - * still hold the lock. - */ -int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, struct list_head *uf, - bool unlock) -{ - return do_vmi_align_munmap(vmi, vma, vma->vm_mm, start, end, uf, unlock); -} - /* * do_brk_flags() - Increase the brk vma if the flags match. * @vmi: The vma iterator diff --git a/mm/vma.c b/mm/vma.c index d2d71d659d1e..713b2196d351 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -658,8 +658,8 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms, */ mas_set(mas_detach, 1); lru_add_drain(); - tlb_gather_mmu(&tlb, vms->mm); - update_hiwater_rss(vms->mm); + tlb_gather_mmu(&tlb, vms->vma->vm_mm); + update_hiwater_rss(vms->vma->vm_mm); unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked); @@ -672,14 +672,14 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms, } void vms_clean_up_area(struct vma_munmap_struct *vms, - struct ma_state *mas_detach, bool mm_wr_locked) + struct ma_state *mas_detach) { struct vm_area_struct *vma; if (!vms->nr_pages) return; - vms_clear_ptes(vms, mas_detach, mm_wr_locked); + vms_clear_ptes(vms, mas_detach, true); mas_set(mas_detach, 0); mas_for_each(mas_detach, vma, ULONG_MAX) if (vma->vm_ops && vma->vm_ops->close) @@ -702,7 +702,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct vm_area_struct *vma; struct mm_struct *mm; - mm = vms->mm; + mm = current->mm; mm->map_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; if (vms->unlock) @@ -770,7 +770,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && - vms->mm->map_count >= sysctl_max_map_count) + vms->vma->vm_mm->map_count >= sysctl_max_map_count) goto map_count_exceeded; /* Don't bother splitting the VMA if we can't unmap it anyway */ diff --git a/mm/vma.h b/mm/vma.h index 37b4b5d00b99..b59d470cc223 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -31,7 +31,6 @@ struct unlink_vma_file_batch { */ struct vma_munmap_struct { struct vma_iterator *vmi; - struct mm_struct *mm; struct vm_area_struct *vma; /* The first vma to munmap */ struct vm_area_struct *prev; /* vma before the munmap area */ struct vm_area_struct *next; /* vma after the munmap area */ @@ -114,7 +113,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms, unsigned long start, unsigned long end, struct list_head *uf, bool unlock) { - vms->mm = current->mm; vms->vmi = vmi; vms->vma = vma; if (vma) { @@ -142,7 +140,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach); void vms_clean_up_area(struct vma_munmap_struct *vms, - struct ma_state *mas_detach, bool mm_wr_locked); + struct ma_state *mas_detach); /* * reattach_vmas() - Undo any munmap work and free resources From 224c1c702c08ca4d874690991f02e5b08c816e5b Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:00:59 -0400 Subject: [PATCH 264/416] mm: move may_expand_vm() check in mmap_region() The may_expand_vm() check requires the count of the pages within the munmap range. Since this is needed for accounting and obtained later, the reodering of ma_expand_vm() to later in the call stack, after the vma munmap struct (vms) is initialised and the gather stage is potentially run, will allow for a single loop over the vmas. The gather sage does not commit any work and so everything can be undone in the case of a failure. The MAP_FIXED page count is available after the vms_gather_munmap_vmas() call, so use it instead of looping over the vmas twice. Link: https://lkml.kernel.org/r/20240830040101.822209-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 15 ++++----------- mm/vma.c | 21 --------------------- mm/vma.h | 3 --- 3 files changed, 4 insertions(+), 35 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 6485f2300692..fc726c4e98be 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1376,17 +1376,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, pgoff_t vm_pgoff; int error = -ENOMEM; VMA_ITERATOR(vmi, mm, addr); - unsigned long nr_pages, nr_accounted; - - nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted); - - /* - * Check against address space limit. - * MAP_FIXED may remove pages of mappings that intersects with requested - * mapping. Account for the pages it would unmap. - */ - if (!may_expand_vm(mm, vm_flags, pglen - nr_pages)) - return -ENOMEM; /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); @@ -1410,6 +1399,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_iter_next_range(&vmi); } + /* Check against address space limit. */ + if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) + goto abort_munmap; + /* * Private writable mapping: check memory availability */ diff --git a/mm/vma.c b/mm/vma.c index 713b2196d351..3350fe2088bc 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1640,27 +1640,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot) return vma_fs_can_writeback(vma); } -unsigned long count_vma_pages_range(struct mm_struct *mm, - unsigned long addr, unsigned long end, - unsigned long *nr_accounted) -{ - VMA_ITERATOR(vmi, mm, addr); - struct vm_area_struct *vma; - unsigned long nr_pages = 0; - - *nr_accounted = 0; - for_each_vma_range(vmi, vma, end) { - unsigned long vm_start = max(addr, vma->vm_start); - unsigned long vm_end = min(end, vma->vm_end); - - nr_pages += PHYS_PFN(vm_end - vm_start); - if (vma->vm_flags & VM_ACCOUNT) - *nr_accounted += PHYS_PFN(vm_end - vm_start); - } - - return nr_pages; -} - static DEFINE_MUTEX(mm_all_locks_mutex); static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma) diff --git a/mm/vma.h b/mm/vma.h index b59d470cc223..45fbc56bc0b0 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -319,9 +319,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot); int mm_take_all_locks(struct mm_struct *mm); void mm_drop_all_locks(struct mm_struct *mm); -unsigned long count_vma_pages_range(struct mm_struct *mm, - unsigned long addr, unsigned long end, - unsigned long *nr_accounted); static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma) { From 20831cd6f814eade5e2fffce41352d8e8de1e764 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:01:00 -0400 Subject: [PATCH 265/416] mm/vma: drop incorrect comment from vms_gather_munmap_vmas() The comment has been outdated since 6b73cff239e52 ("mm: change munmap splitting order and move_vma()"). The move_vma() was altered to fix the fragile state of the accounting since then. Link: https://lkml.kernel.org/r/20240830040101.822209-21-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 3350fe2088bc..1736bb237b2c 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -755,13 +755,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, /* * If we need to split any vma, do it now to save pain later. - * - * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially - * unmapped vm_area_struct will remain in use: so lower split_vma - * places tmp vma above, and higher split_vma places tmp vma below. + * Does it split the first one? */ - - /* Does it split the first one? */ if (vms->start > vms->vma->vm_start) { /* From 723e1e8b7756a552f4d6ddc8047bffe452187617 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 30 Aug 2024 00:01:01 -0400 Subject: [PATCH 266/416] mm/vma.h: optimise vma_munmap_struct The vma_munmap_struct has a hole of 4 bytes and pushes the struct to three cachelines. Relocating the three booleans upwards allows for the struct to only use two cachelines (as reported by pahole on amd64). Before: struct vma_munmap_struct { struct vma_iterator * vmi; /* 0 8 */ struct vm_area_struct * vma; /* 8 8 */ struct vm_area_struct * prev; /* 16 8 */ struct vm_area_struct * next; /* 24 8 */ struct list_head * uf; /* 32 8 */ long unsigned int start; /* 40 8 */ long unsigned int end; /* 48 8 */ long unsigned int unmap_start; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int unmap_end; /* 64 8 */ int vma_count; /* 72 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int nr_pages; /* 80 8 */ long unsigned int locked_vm; /* 88 8 */ long unsigned int nr_accounted; /* 96 8 */ long unsigned int exec_vm; /* 104 8 */ long unsigned int stack_vm; /* 112 8 */ long unsigned int data_vm; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ bool unlock; /* 128 1 */ bool clear_ptes; /* 129 1 */ bool closed_vm_ops; /* 130 1 */ /* size: 136, cachelines: 3, members: 19 */ /* sum members: 127, holes: 1, sum holes: 4 */ /* padding: 5 */ /* last cacheline: 8 bytes */ }; After: struct vma_munmap_struct { struct vma_iterator * vmi; /* 0 8 */ struct vm_area_struct * vma; /* 8 8 */ struct vm_area_struct * prev; /* 16 8 */ struct vm_area_struct * next; /* 24 8 */ struct list_head * uf; /* 32 8 */ long unsigned int start; /* 40 8 */ long unsigned int end; /* 48 8 */ long unsigned int unmap_start; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int unmap_end; /* 64 8 */ int vma_count; /* 72 4 */ bool unlock; /* 76 1 */ bool clear_ptes; /* 77 1 */ bool closed_vm_ops; /* 78 1 */ /* XXX 1 byte hole, try to pack */ long unsigned int nr_pages; /* 80 8 */ long unsigned int locked_vm; /* 88 8 */ long unsigned int nr_accounted; /* 96 8 */ long unsigned int exec_vm; /* 104 8 */ long unsigned int stack_vm; /* 112 8 */ long unsigned int data_vm; /* 120 8 */ /* size: 128, cachelines: 2, members: 19 */ /* sum members: 127, holes: 1, sum holes: 1 */ }; Link: https://lkml.kernel.org/r/20240830040101.822209-22-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Mark Brown Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vma.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/vma.h b/mm/vma.h index 45fbc56bc0b0..2a3a5c89a33b 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -40,15 +40,16 @@ struct vma_munmap_struct { unsigned long unmap_start; /* Unmap PTE start */ unsigned long unmap_end; /* Unmap PTE end */ int vma_count; /* Number of vmas that will be removed */ + bool unlock; /* Unlock after the munmap */ + bool clear_ptes; /* If there are outstanding PTE to be cleared */ + bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */ + /* 1 byte hole */ unsigned long nr_pages; /* Number of pages being removed */ unsigned long locked_vm; /* Number of locked pages */ unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */ unsigned long exec_vm; unsigned long stack_vm; unsigned long data_vm; - bool unlock; /* Unlock after the munmap */ - bool clear_ptes; /* If there are outstanding PTE to be cleared */ - bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */ }; #ifdef CONFIG_DEBUG_VM_MAPLE_TREE From 4e52a60ac5c079000add146ca39b0edcc30f74cd Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:13 +0100 Subject: [PATCH 267/416] tools: improve vma test Makefile Patch series "mm: remove vma_merge()", v3. The infamous vma_merge() function has been the cause of a great deal of pain, bugs and confusion for a very long time. It is subtle, contains many corner cases, tries to do far too much and is as a result very fragile. The fact that the function requires there to be a numbering system to cover each possible eventuality with references to each in the many branches of its implementation as to which case you are looking at speaks to all this. Some of this complexity is inherent - unfortunately there is no getting away from the need to figure out precisely how to execute the merge, whether we need to remove VMAs, whether it is safe to do so, what constitutes a mergeable VMA and so on. However, a lot of the complexity is not inherent but instead a product of the function's 'organic' development. Liam has gone to great lengths to improve the situation as a part of his maple tree implementation, greatly improving the readability of the code, and Vlastimil and myself have additionally gone to lengths to try to improve things further. However, with the availability of userland VMA testing, it now becomes possible to perform a rather more significant refactoring while maintaining confidence in its correct operation. An attempt was previously made by Vlastimil [0] to eliminate vma_merge(), however it was rather - brutal - and an astute reader might refer to the date of that patch for insight as to its intent. This series instead divides merge operations into two natural kinds - merges which occur when a NEW vma is being added to the address space, and merges which occur when a vma is being MODIFIED. Happily, the vma_expand() function introduced by Liam, which has the capacity for also deleting a subsequent VMA, covers each of the NEW vma cases. By abstracting the actual final commit of changes to a VMA to its own function, commit_merge() and writing a wrapper around vma_expand() for new VMA cases vma_merge_new_range(), we can avoid having to use vma_merge() for these instances altogether. By doing so we are also able to then de-duplicate all existing merge logic in mmap_region() and do_brk_flags() and have everything invoke this new function, so we universally take the same approach to merging new VMAs. Having done so, we can then completely rework vma_merge() into vma_merge_existing_range() and use this for the instances where a merge is proposed for a region of an existing VMA. This eliminates vma_merge() and its numbered cases and instead divides things into logical cases - merge both, merge left, merge right (the latter 2 being either partial or full merges). The code is heavily annotated with ASCII diagrams and greatly simplified in comparison to the existing vma_merge() function. Having made this change, we take the opportunity to address an issue with merging VMAs possessing a vm_ops->close() hook - commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") and commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") make efforts to relax how we handle these, making assumptions about which VMAs might end up deleted (and thus, if possessing a vm_ops->close() hook, cannot be). This refactor means we do not need to guess, so instead explicitly only disallow merge in instances where a VMA with a vm_ops->close() hook would be deleted (and try a smaller merge in cases where this is possible). In addition to these changes, we introduce a new vma_merge_struct abstraction to allow VMA merge state to be threaded through the operation neatly. There is heavy unit testing provided for all merge functionality, added prior to the refactoring, allowing for before/after testing. The vm_ops->close() change also introduces exhaustive testing to demonstrate that this functions as expected, and in addition to this the reproduction code from commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") was tested and confirmed passing. [0]:https://lore.kernel.org/linux-mm/20240401192623.18575-2-vbabka@suse.cz/ This patch (of 10): Have vma.o depend on its source dependencies explicitly, as previously these were simply being ignored as existing object files were up to date. This now correctly re-triggers the build if mm/ source is changed as well as local source code. Also set clean as a phony rule. Link: https://lkml.kernel.org/r/cover.1725040657.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/e3ea58f08364ae5432c9a074de0195a7c7e0b04a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- tools/testing/vma/Makefile | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/testing/vma/Makefile b/tools/testing/vma/Makefile index bfc905d222cf..860fd2311dcc 100644 --- a/tools/testing/vma/Makefile +++ b/tools/testing/vma/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-or-later -.PHONY: default +.PHONY: default clean default: vma @@ -9,7 +9,9 @@ include ../shared/shared.mk OFILES = $(SHARED_OFILES) vma.o maple-shim.o TARGETS = vma -vma: $(OFILES) vma_internal.h ../../../mm/vma.c ../../../mm/vma.h +vma.o: vma.c vma_internal.h ../../../mm/vma.c ../../../mm/vma.h + +vma: $(OFILES) $(CC) $(CFLAGS) -o $@ $(OFILES) $(LDLIBS) clean: From 955db39676b6de84283b370d03683171b67dceb3 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:14 +0100 Subject: [PATCH 268/416] tools: add VMA merge tests Add a variety of VMA merge unit tests to assert that the behaviour of VMA merge is correct at an abstract level and VMAs are merged or not merged as expected. These are intentionally added _before_ we start refactoring vma_merge() in order that we can continually assert correctness throughout the rest of the series. In order to reduce churn going forward, we backport the vma_merge_struct data type to the test code which we introduce and use in a future commit, and add wrappers around the merge new and existing VMA cases. Link: https://lkml.kernel.org/r/1c7a0b43cfad2c511a6b1b52f3507696478ff51a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- tools/testing/vma/vma.c | 1282 +++++++++++++++++++++++++++++- tools/testing/vma/vma_internal.h | 45 +- 2 files changed, 1317 insertions(+), 10 deletions(-) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 48e033c60d87..71bd30d5da81 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -7,13 +7,43 @@ #include "maple-shared.h" #include "vma_internal.h" +/* Include so header guard set. */ +#include "../../../mm/vma.h" + +static bool fail_prealloc; + +/* Then override vma_iter_prealloc() so we can choose to fail it. */ +#define vma_iter_prealloc(vmi, vma) \ + (fail_prealloc ? -ENOMEM : mas_preallocate(&(vmi)->mas, (vma), GFP_KERNEL)) + /* * Directly import the VMA implementation here. Our vma_internal.h wrapper * provides userland-equivalent functionality for everything vma.c uses. */ #include "../../../mm/vma.c" +/* + * Temporarily forward-ported from a future in which vmg's are used for merging. + */ +struct vma_merge_struct { + struct mm_struct *mm; + struct vma_iterator *vmi; + pgoff_t pgoff; + struct vm_area_struct *prev; + struct vm_area_struct *next; /* Modified by vma_merge(). */ + struct vm_area_struct *vma; /* Either a new VMA or the one being modified. */ + unsigned long start; + unsigned long end; + unsigned long flags; + struct file *file; + struct anon_vma *anon_vma; + struct mempolicy *policy; + struct vm_userfaultfd_ctx uffd_ctx; + struct anon_vma_name *anon_name; +}; + const struct vm_operations_struct vma_dummy_vm_ops; +static struct anon_vma dummy_anon_vma; #define ASSERT_TRUE(_expr) \ do { \ @@ -28,6 +58,14 @@ const struct vm_operations_struct vma_dummy_vm_ops; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2)) +static struct task_struct __current; + +struct task_struct *get_current(void) +{ + return &__current; +} + +/* Helper function to simply allocate a VMA. */ static struct vm_area_struct *alloc_vma(struct mm_struct *mm, unsigned long start, unsigned long end, @@ -47,22 +85,201 @@ static struct vm_area_struct *alloc_vma(struct mm_struct *mm, return ret; } +/* Helper function to allocate a VMA and link it to the tree. */ +static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm, + unsigned long start, + unsigned long end, + pgoff_t pgoff, + vm_flags_t flags) +{ + struct vm_area_struct *vma = alloc_vma(mm, start, end, pgoff, flags); + + if (vma == NULL) + return NULL; + + if (vma_link(mm, vma)) { + vm_area_free(vma); + return NULL; + } + + /* + * Reset this counter which we use to track whether writes have + * begun. Linking to the tree will have caused this to be incremented, + * which means we will get a false positive otherwise. + */ + vma->vm_lock_seq = -1; + + return vma; +} + +/* Helper function which provides a wrapper around a merge new VMA operation. */ +static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) +{ + /* vma_merge() needs a VMA to determine mm, anon_vma, and file. */ + struct vm_area_struct dummy = { + .vm_mm = vmg->mm, + .vm_flags = vmg->flags, + .anon_vma = vmg->anon_vma, + .vm_file = vmg->file, + }; + + /* + * For convenience, get prev and next VMAs. Which the new VMA operation + * requires. + */ + vmg->next = vma_next(vmg->vmi); + vmg->prev = vma_prev(vmg->vmi); + + vma_iter_set(vmg->vmi, vmg->start); + return vma_merge_new_vma(vmg->vmi, vmg->prev, &dummy, vmg->start, + vmg->end, vmg->pgoff); +} + +/* + * Helper function which provides a wrapper around a merge existing VMA + * operation. + */ +static struct vm_area_struct *merge_existing(struct vma_merge_struct *vmg) +{ + /* vma_merge() needs a VMA to determine mm, anon_vma, and file. */ + struct vm_area_struct dummy = { + .vm_mm = vmg->mm, + .vm_flags = vmg->flags, + .anon_vma = vmg->anon_vma, + .vm_file = vmg->file, + }; + + return vma_merge(vmg->vmi, vmg->prev, &dummy, vmg->start, vmg->end, + vmg->flags, vmg->pgoff, vmg->policy, vmg->uffd_ctx, + vmg->anon_name); +} + +/* + * Helper function which provides a wrapper around the expansion of an existing + * VMA. + */ +static int expand_existing(struct vma_merge_struct *vmg) +{ + return vma_expand(vmg->vmi, vmg->vma, vmg->start, vmg->end, vmg->pgoff, + vmg->next); +} + +/* + * Helper function to reset merge state the associated VMA iterator to a + * specified new range. + */ +static void vmg_set_range(struct vma_merge_struct *vmg, unsigned long start, + unsigned long end, pgoff_t pgoff, vm_flags_t flags) +{ + vma_iter_set(vmg->vmi, start); + + vmg->prev = NULL; + vmg->next = NULL; + vmg->vma = NULL; + + vmg->start = start; + vmg->end = end; + vmg->pgoff = pgoff; + vmg->flags = flags; +} + +/* + * Helper function to try to merge a new VMA. + * + * Update vmg and the iterator for it and try to merge, otherwise allocate a new + * VMA, link it to the maple tree and return it. + */ +static struct vm_area_struct *try_merge_new_vma(struct mm_struct *mm, + struct vma_merge_struct *vmg, + unsigned long start, unsigned long end, + pgoff_t pgoff, vm_flags_t flags, + bool *was_merged) +{ + struct vm_area_struct *merged; + + vmg_set_range(vmg, start, end, pgoff, flags); + + merged = merge_new(vmg); + if (merged) { + *was_merged = true; + return merged; + } + + *was_merged = false; + return alloc_and_link_vma(mm, start, end, pgoff, flags); +} + +/* + * Helper function to reset the dummy anon_vma to indicate it has not been + * duplicated. + */ +static void reset_dummy_anon_vma(void) +{ + dummy_anon_vma.was_cloned = false; + dummy_anon_vma.was_unlinked = false; +} + +/* + * Helper function to remove all VMAs and destroy the maple tree associated with + * a virtual address space. Returns a count of VMAs in the tree. + */ +static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) +{ + struct vm_area_struct *vma; + int count = 0; + + fail_prealloc = false; + reset_dummy_anon_vma(); + + vma_iter_set(vmi, 0); + for_each_vma(*vmi, vma) { + vm_area_free(vma); + count++; + } + + mtree_destroy(&mm->mm_mt); + mm->map_count = 0; + return count; +} + +/* Helper function to determine if VMA has had vma_start_write() performed. */ +static bool vma_write_started(struct vm_area_struct *vma) +{ + int seq = vma->vm_lock_seq; + + /* We reset after each check. */ + vma->vm_lock_seq = -1; + + /* The vma_start_write() stub simply increments this value. */ + return seq > -1; +} + +/* Helper function providing a dummy vm_ops->close() method.*/ +static void dummy_close(struct vm_area_struct *) +{ +} + static bool test_simple_merge(void) { struct vm_area_struct *vma; unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; struct vm_area_struct *vma_left = alloc_vma(&mm, 0, 0x1000, 0, flags); - struct vm_area_struct *vma_middle = alloc_vma(&mm, 0x1000, 0x2000, 1, flags); struct vm_area_struct *vma_right = alloc_vma(&mm, 0x2000, 0x3000, 2, flags); VMA_ITERATOR(vmi, &mm, 0x1000); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + .start = 0x1000, + .end = 0x2000, + .flags = flags, + .pgoff = 1, + }; ASSERT_FALSE(vma_link(&mm, vma_left)); - ASSERT_FALSE(vma_link(&mm, vma_middle)); ASSERT_FALSE(vma_link(&mm, vma_right)); - vma = vma_merge_new_vma(&vmi, vma_left, vma_middle, 0x1000, - 0x2000, 1); + vma = merge_new(&vmg); ASSERT_NE(vma, NULL); ASSERT_EQ(vma->vm_start, 0); @@ -142,10 +359,17 @@ static bool test_simple_expand(void) struct mm_struct mm = {}; struct vm_area_struct *vma = alloc_vma(&mm, 0, 0x1000, 0, flags); VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .vmi = &vmi, + .vma = vma, + .start = 0, + .end = 0x3000, + .pgoff = 0, + }; ASSERT_FALSE(vma_link(&mm, vma)); - ASSERT_FALSE(vma_expand(&vmi, vma, 0, 0x3000, 0, NULL)); + ASSERT_FALSE(expand_existing(&vmg)); ASSERT_EQ(vma->vm_start, 0); ASSERT_EQ(vma->vm_end, 0x3000); @@ -178,6 +402,1042 @@ static bool test_simple_shrink(void) return true; } +static bool test_merge_new(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + struct anon_vma_chain dummy_anon_vma_chain_a = { + .anon_vma = &dummy_anon_vma, + }; + struct anon_vma_chain dummy_anon_vma_chain_b = { + .anon_vma = &dummy_anon_vma, + }; + struct anon_vma_chain dummy_anon_vma_chain_c = { + .anon_vma = &dummy_anon_vma, + }; + struct anon_vma_chain dummy_anon_vma_chain_d = { + .anon_vma = &dummy_anon_vma, + }; + int count; + struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; + bool merged; + + /* + * 0123456789abc + * AA B CC + */ + vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + ASSERT_NE(vma_a, NULL); + /* We give each VMA a single avc so we can test anon_vma duplication. */ + INIT_LIST_HEAD(&vma_a->anon_vma_chain); + list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain); + + vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + ASSERT_NE(vma_b, NULL); + INIT_LIST_HEAD(&vma_b->anon_vma_chain); + list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain); + + vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, flags); + ASSERT_NE(vma_c, NULL); + INIT_LIST_HEAD(&vma_c->anon_vma_chain); + list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain); + + /* + * NO merge. + * + * 0123456789abc + * AA B ** CC + */ + vma_d = try_merge_new_vma(&mm, &vmg, 0x7000, 0x9000, 7, flags, &merged); + ASSERT_NE(vma_d, NULL); + INIT_LIST_HEAD(&vma_d->anon_vma_chain); + list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain); + ASSERT_FALSE(merged); + ASSERT_EQ(mm.map_count, 4); + + /* + * Merge BOTH sides. + * + * 0123456789abc + * AA*B DD CC + */ + vma_b->anon_vma = &dummy_anon_vma; + vma = try_merge_new_vma(&mm, &vmg, 0x2000, 0x3000, 2, flags, &merged); + ASSERT_EQ(vma, vma_a); + /* Merge with A, delete B. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x4000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 3); + + /* + * Merge to PREVIOUS VMA. + * + * 0123456789abc + * AAAA* DD CC + */ + vma = try_merge_new_vma(&mm, &vmg, 0x4000, 0x5000, 4, flags, &merged); + ASSERT_EQ(vma, vma_a); + /* Extend A. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x5000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 3); + + /* + * Merge to NEXT VMA. + * + * 0123456789abc + * AAAAA *DD CC + */ + vma_d->anon_vma = &dummy_anon_vma; + vma = try_merge_new_vma(&mm, &vmg, 0x6000, 0x7000, 6, flags, &merged); + ASSERT_EQ(vma, vma_d); + /* Prepend. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0x6000); + ASSERT_EQ(vma->vm_end, 0x9000); + ASSERT_EQ(vma->vm_pgoff, 6); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 3); + + /* + * Merge BOTH sides. + * + * 0123456789abc + * AAAAA*DDD CC + */ + vma = try_merge_new_vma(&mm, &vmg, 0x5000, 0x6000, 5, flags, &merged); + ASSERT_EQ(vma, vma_a); + /* Merge with A, delete D. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x9000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 2); + + /* + * Merge to NEXT VMA. + * + * 0123456789abc + * AAAAAAAAA *CC + */ + vma_c->anon_vma = &dummy_anon_vma; + vma = try_merge_new_vma(&mm, &vmg, 0xa000, 0xb000, 0xa, flags, &merged); + ASSERT_EQ(vma, vma_c); + /* Prepend C. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0xa000); + ASSERT_EQ(vma->vm_end, 0xc000); + ASSERT_EQ(vma->vm_pgoff, 0xa); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 2); + + /* + * Merge BOTH sides. + * + * 0123456789abc + * AAAAAAAAA*CCC + */ + vma = try_merge_new_vma(&mm, &vmg, 0x9000, 0xa000, 0x9, flags, &merged); + ASSERT_EQ(vma, vma_a); + /* Extend A and delete C. */ + ASSERT_TRUE(merged); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0xc000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 1); + + /* + * Final state. + * + * 0123456789abc + * AAAAAAAAAAAAA + */ + + count = 0; + vma_iter_set(&vmi, 0); + for_each_vma(vmi, vma) { + ASSERT_NE(vma, NULL); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0xc000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); + + vm_area_free(vma); + count++; + } + + /* Should only have one VMA left (though freed) after all is done.*/ + ASSERT_EQ(count, 1); + + mtree_destroy(&mm.mm_mt); + return true; +} + +static bool test_vma_merge_special_flags(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + vm_flags_t special_flags[] = { VM_IO, VM_DONTEXPAND, VM_PFNMAP, VM_MIXEDMAP }; + vm_flags_t all_special_flags = 0; + int i; + struct vm_area_struct *vma_left, *vma; + + /* Make sure there aren't new VM_SPECIAL flags. */ + for (i = 0; i < ARRAY_SIZE(special_flags); i++) { + all_special_flags |= special_flags[i]; + } + ASSERT_EQ(all_special_flags, VM_SPECIAL); + + /* + * 01234 + * AAA + */ + vma_left = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + ASSERT_NE(vma_left, NULL); + + /* 1. Set up new VMA with special flag that would otherwise merge. */ + + /* + * 01234 + * AAA* + * + * This should merge if not for the VM_SPECIAL flag. + */ + vmg_set_range(&vmg, 0x3000, 0x4000, 3, flags); + for (i = 0; i < ARRAY_SIZE(special_flags); i++) { + vm_flags_t special_flag = special_flags[i]; + + vma_left->__vm_flags = flags | special_flag; + vmg.flags = flags | special_flag; + vma = merge_new(&vmg); + ASSERT_EQ(vma, NULL); + } + + /* 2. Modify VMA with special flag that would otherwise merge. */ + + /* + * 01234 + * AAAB + * + * Create a VMA to modify. + */ + vma = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + ASSERT_NE(vma, NULL); + vmg.vma = vma; + + for (i = 0; i < ARRAY_SIZE(special_flags); i++) { + vm_flags_t special_flag = special_flags[i]; + + vma_left->__vm_flags = flags | special_flag; + vmg.flags = flags | special_flag; + vma = merge_existing(&vmg); + ASSERT_EQ(vma, NULL); + } + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_vma_merge_with_close(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + const struct vm_operations_struct vm_ops = { + .close = dummy_close, + }; + struct vm_area_struct *vma_next = + alloc_and_link_vma(&mm, 0x2000, 0x3000, 2, flags); + struct vm_area_struct *vma; + + /* + * When we merge VMAs we sometimes have to delete others as part of the + * operation. + * + * Considering the two possible adjacent VMAs to which a VMA can be + * merged: + * + * [ prev ][ vma ][ next ] + * + * In no case will we need to delete prev. If the operation is + * mergeable, then prev will be extended with one or both of vma and + * next deleted. + * + * As a result, during initial mergeability checks, only + * can_vma_merge_before() (which implies the VMA being merged with is + * 'next' as shown above) bothers to check to see whether the next VMA + * has a vm_ops->close() callback that will need to be called when + * removed. + * + * If it does, then we cannot merge as the resources that the close() + * operation potentially clears down are tied only to the existing VMA + * range and we have no way of extending those to the nearly merged one. + * + * We must consider two scenarios: + * + * A. + * + * vm_ops->close: - - !NULL + * [ prev ][ vma ][ next ] + * + * Where prev may or may not be present/mergeable. + * + * This is picked up by a specific check in can_vma_merge_before(). + * + * B. + * + * vm_ops->close: - !NULL + * [ prev ][ vma ] + * + * Where prev and vma are present and mergeable. + * + * This is picked up by a specific check in the modified VMA merge. + * + * IMPORTANT NOTE: We make the assumption that the following case: + * + * - !NULL NULL + * [ prev ][ vma ][ next ] + * + * Cannot occur, because vma->vm_ops being the same implies the same + * vma->vm_file, and therefore this would mean that next->vm_ops->close + * would be set too, and thus scenario A would pick this up. + */ + + ASSERT_NE(vma_next, NULL); + + /* + * SCENARIO A + * + * 0123 + * *N + */ + + /* Make the next VMA have a close() callback. */ + vma_next->vm_ops = &vm_ops; + + /* Our proposed VMA has characteristics that would otherwise be merged. */ + vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); + + /* The next VMA having a close() operator should cause the merge to fail.*/ + ASSERT_EQ(merge_new(&vmg), NULL); + + /* Now create the VMA so we can merge via modified flags */ + vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); + vma = alloc_and_link_vma(&mm, 0x1000, 0x2000, 1, flags); + vmg.vma = vma; + + /* + * The VMA being modified in a way that would otherwise merge should + * also fail. + */ + ASSERT_EQ(merge_existing(&vmg), NULL); + + /* SCENARIO B + * + * 0123 + * P* + * + * In order for this scenario to trigger, the VMA currently being + * modified must also have a .close(). + */ + + /* Reset VMG state. */ + vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); + /* + * Make next unmergeable, and don't let the scenario A check pick this + * up, we want to reproduce scenario B only. + */ + vma_next->vm_ops = NULL; + vma_next->__vm_flags &= ~VM_MAYWRITE; + /* Allocate prev. */ + vmg.prev = alloc_and_link_vma(&mm, 0, 0x1000, 0, flags); + /* Assign a vm_ops->close() function to VMA explicitly. */ + vma->vm_ops = &vm_ops; + vmg.vma = vma; + /* Make sure merge does not occur. */ + ASSERT_EQ(merge_existing(&vmg), NULL); + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_vma_merge_new_with_close(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + struct vm_area_struct *vma_prev = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + struct vm_area_struct *vma_next = alloc_and_link_vma(&mm, 0x5000, 0x7000, 5, flags); + const struct vm_operations_struct vm_ops = { + .close = dummy_close, + }; + struct vm_area_struct *vma; + + /* + * We should allow the partial merge of a proposed new VMA if the + * surrounding VMAs have vm_ops->close() hooks (but are otherwise + * compatible), e.g.: + * + * New VMA + * A v-------v B + * |-----| |-----| + * close close + * + * Since the rule is to not DELETE a VMA with a close operation, this + * should be permitted, only rather than expanding A and deleting B, we + * should simply expand A and leave B intact, e.g.: + * + * New VMA + * A B + * |------------||-----| + * close close + */ + + /* Have prev and next have a vm_ops->close() hook. */ + vma_prev->vm_ops = &vm_ops; + vma_next->vm_ops = &vm_ops; + + vmg_set_range(&vmg, 0x2000, 0x5000, 2, flags); + vma = merge_new(&vmg); + ASSERT_NE(vma, NULL); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x5000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_EQ(vma->vm_ops, &vm_ops); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 2); + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_merge_existing(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vm_area_struct *vma, *vma_prev, *vma_next; + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + + /* + * Merge right case - partial span. + * + * <-> + * 0123456789 + * VVVVNNN + * -> + * 0123456789 + * VNNNNNN + */ + vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); + vmg.vma = vma; + vmg.prev = vma; + vma->anon_vma = &dummy_anon_vma; + ASSERT_EQ(merge_existing(&vmg), vma_next); + ASSERT_EQ(vma_next->vm_start, 0x3000); + ASSERT_EQ(vma_next->vm_end, 0x9000); + ASSERT_EQ(vma_next->vm_pgoff, 3); + ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); + ASSERT_EQ(vma->vm_start, 0x2000); + ASSERT_EQ(vma->vm_end, 0x3000); + ASSERT_EQ(vma->vm_pgoff, 2); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_TRUE(vma_write_started(vma_next)); + ASSERT_EQ(mm.map_count, 2); + + /* Clear down and reset. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + /* + * Merge right case - full span. + * + * <--> + * 0123456789 + * VVVVNNN + * -> + * 0123456789 + * NNNNNNN + */ + vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vmg_set_range(&vmg, 0x2000, 0x6000, 2, flags); + vmg.vma = vma; + vma->anon_vma = &dummy_anon_vma; + ASSERT_EQ(merge_existing(&vmg), vma_next); + ASSERT_EQ(vma_next->vm_start, 0x2000); + ASSERT_EQ(vma_next->vm_end, 0x9000); + ASSERT_EQ(vma_next->vm_pgoff, 2); + ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma_next)); + ASSERT_EQ(mm.map_count, 1); + + /* Clear down and reset. We should have deleted vma. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); + + /* + * Merge left case - partial span. + * + * <-> + * 0123456789 + * PPPVVVV + * -> + * 0123456789 + * PPPPPPV + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + vma->anon_vma = &dummy_anon_vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x6000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_EQ(vma->vm_start, 0x6000); + ASSERT_EQ(vma->vm_end, 0x7000); + ASSERT_EQ(vma->vm_pgoff, 6); + ASSERT_TRUE(vma_write_started(vma_prev)); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 2); + + /* Clear down and reset. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + /* + * Merge left case - full span. + * + * <--> + * 0123456789 + * PPPVVVV + * -> + * 0123456789 + * PPPPPPP + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + vma->anon_vma = &dummy_anon_vma; + ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x7000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma_prev)); + ASSERT_EQ(mm.map_count, 1); + + /* Clear down and reset. We should have deleted vma. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); + + /* + * Merge both case. + * + * <--> + * 0123456789 + * PPPVVVVNNN + * -> + * 0123456789 + * PPPPPPPPPP + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); + vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + vma->anon_vma = &dummy_anon_vma; + ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x9000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_write_started(vma_prev)); + ASSERT_EQ(mm.map_count, 1); + + /* Clear down and reset. We should have deleted prev and next. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); + + /* + * Non-merge ranges. the modified VMA merge operation assumes that the + * caller always specifies ranges within the input VMA so we need only + * examine these cases. + * + * - + * - + * - + * <-> + * <> + * <> + * 0123456789a + * PPPVVVVVNNN + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, flags); + + vmg_set_range(&vmg, 0x4000, 0x5000, 4, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + vmg_set_range(&vmg, 0x6000, 0x7000, 6, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + vmg_set_range(&vmg, 0x4000, 0x7000, 4, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + vmg_set_range(&vmg, 0x4000, 0x6000, 4, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); + vmg.prev = vma; + vmg.vma = vma; + ASSERT_EQ(merge_existing(&vmg), NULL); + + ASSERT_EQ(cleanup_mm(&mm, &vmi), 3); + + return true; +} + +static bool test_anon_vma_non_mergeable(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vm_area_struct *vma, *vma_prev, *vma_next; + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + struct anon_vma_chain dummy_anon_vma_chain1 = { + .anon_vma = &dummy_anon_vma, + }; + struct anon_vma_chain dummy_anon_vma_chain2 = { + .anon_vma = &dummy_anon_vma, + }; + + /* + * In the case of modified VMA merge, merging both left and right VMAs + * but where prev and next have incompatible anon_vma objects, we revert + * to a merge of prev and VMA: + * + * <--> + * 0123456789 + * PPPVVVVNNN + * -> + * 0123456789 + * PPPPPPPNNN + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); + + /* + * Give both prev and next single anon_vma_chain fields, so they will + * merge with the NULL vmg->anon_vma. + * + * However, when prev is compared to next, the merge should fail. + */ + + INIT_LIST_HEAD(&vma_prev->anon_vma_chain); + list_add(&dummy_anon_vma_chain1.same_vma, &vma_prev->anon_vma_chain); + ASSERT_TRUE(list_is_singular(&vma_prev->anon_vma_chain)); + vma_prev->anon_vma = &dummy_anon_vma; + ASSERT_TRUE(is_mergeable_anon_vma(NULL, vma_prev->anon_vma, vma_prev)); + + INIT_LIST_HEAD(&vma_next->anon_vma_chain); + list_add(&dummy_anon_vma_chain2.same_vma, &vma_next->anon_vma_chain); + ASSERT_TRUE(list_is_singular(&vma_next->anon_vma_chain)); + vma_next->anon_vma = (struct anon_vma *)2; + ASSERT_TRUE(is_mergeable_anon_vma(NULL, vma_next->anon_vma, vma_next)); + + ASSERT_FALSE(is_mergeable_anon_vma(vma_prev->anon_vma, vma_next->anon_vma, NULL)); + + vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x7000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + ASSERT_TRUE(vma_write_started(vma_prev)); + ASSERT_FALSE(vma_write_started(vma_next)); + + /* Clear down and reset. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + /* + * Now consider the new VMA case. This is equivalent, only adding a new + * VMA in a gap between prev and next. + * + * <--> + * 0123456789 + * PPP****NNN + * -> + * 0123456789 + * PPPPPPPNNN + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); + + INIT_LIST_HEAD(&vma_prev->anon_vma_chain); + list_add(&dummy_anon_vma_chain1.same_vma, &vma_prev->anon_vma_chain); + vma_prev->anon_vma = (struct anon_vma *)1; + + INIT_LIST_HEAD(&vma_next->anon_vma_chain); + list_add(&dummy_anon_vma_chain2.same_vma, &vma_next->anon_vma_chain); + vma_next->anon_vma = (struct anon_vma *)2; + + vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); + vmg.prev = vma_prev; + + ASSERT_EQ(merge_new(&vmg), vma_prev); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x7000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + ASSERT_TRUE(vma_write_started(vma_prev)); + ASSERT_FALSE(vma_write_started(vma_next)); + + /* Final cleanup. */ + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + return true; +} + +static bool test_dup_anon_vma(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + struct anon_vma_chain dummy_anon_vma_chain = { + .anon_vma = &dummy_anon_vma, + }; + struct vm_area_struct *vma_prev, *vma_next, *vma; + + reset_dummy_anon_vma(); + + /* + * Expanding a VMA delete the next one duplicates next's anon_vma and + * assigns it to the expanded VMA. + * + * This covers new VMA merging, as these operations amount to a VMA + * expand. + */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_next = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next->anon_vma = &dummy_anon_vma; + + vmg_set_range(&vmg, 0, 0x5000, 0, flags); + vmg.vma = vma_prev; + vmg.next = vma_next; + + ASSERT_EQ(expand_existing(&vmg), 0); + + /* Will have been cloned. */ + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_prev->anon_vma->was_cloned); + + /* Cleanup ready for next run. */ + cleanup_mm(&mm, &vmi); + + /* + * next has anon_vma, we assign to prev. + * + * |<----->| + * |-------*********-------| + * prev vma next + * extend delete delete + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + + /* Initialise avc so mergeability check passes. */ + INIT_LIST_HEAD(&vma_next->anon_vma_chain); + list_add(&dummy_anon_vma_chain.same_vma, &vma_next->anon_vma_chain); + + vma_next->anon_vma = &dummy_anon_vma; + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x8000); + + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_prev->anon_vma->was_cloned); + + cleanup_mm(&mm, &vmi); + + /* + * vma has anon_vma, we assign to prev. + * + * |<----->| + * |-------*********-------| + * prev vma next + * extend delete delete + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + + vma->anon_vma = &dummy_anon_vma; + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x8000); + + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_prev->anon_vma->was_cloned); + + cleanup_mm(&mm, &vmi); + + /* + * vma has anon_vma, we assign to prev. + * + * |<----->| + * |-------************* + * prev vma + * extend shrink/delete + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, flags); + + vma->anon_vma = &dummy_anon_vma; + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x5000); + + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_prev->anon_vma->was_cloned); + + cleanup_mm(&mm, &vmi); + + /* + * vma has anon_vma, we assign to next. + * + * |<----->| + * *************-------| + * vma next + * shrink/delete extend + */ + + vma = alloc_and_link_vma(&mm, 0, 0x5000, 0, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x8000, 5, flags); + + vma->anon_vma = &dummy_anon_vma; + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_next); + + ASSERT_EQ(vma_next->vm_start, 0x3000); + ASSERT_EQ(vma_next->vm_end, 0x8000); + + ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(vma_next->anon_vma->was_cloned); + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_vmi_prealloc_fail(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vma_merge_struct vmg = { + .mm = &mm, + .vmi = &vmi, + }; + struct vm_area_struct *vma_prev, *vma; + + /* + * We are merging vma into prev, with vma possessing an anon_vma, which + * will be duplicated. We cause the vmi preallocation to fail and assert + * the duplicated anon_vma is unlinked. + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma->anon_vma = &dummy_anon_vma; + + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + fail_prealloc = true; + + /* This will cause the merge to fail. */ + ASSERT_EQ(merge_existing(&vmg), NULL); + /* We will already have assigned the anon_vma. */ + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + /* And it was both cloned and unlinked. */ + ASSERT_TRUE(dummy_anon_vma.was_cloned); + ASSERT_TRUE(dummy_anon_vma.was_unlinked); + + cleanup_mm(&mm, &vmi); /* Resets fail_prealloc too. */ + + /* + * We repeat the same operation for expanding a VMA, which is what new + * VMA merging ultimately uses too. This asserts that unlinking is + * performed in this case too. + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma->anon_vma = &dummy_anon_vma; + + vmg_set_range(&vmg, 0, 0x5000, 3, flags); + vmg.vma = vma_prev; + vmg.next = vma; + + fail_prealloc = true; + ASSERT_EQ(expand_existing(&vmg), -ENOMEM); + + ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); + ASSERT_TRUE(dummy_anon_vma.was_cloned); + ASSERT_TRUE(dummy_anon_vma.was_unlinked); + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_merge_extend(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0x1000); + struct vm_area_struct *vma; + + vma = alloc_and_link_vma(&mm, 0, 0x1000, 0, flags); + alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, flags); + + /* + * Extend a VMA into the gap between itself and the following VMA. + * This should result in a merge. + * + * <-> + * * * + * + */ + + ASSERT_EQ(vma_merge_extend(&vmi, vma, 0x2000), vma); + ASSERT_EQ(vma->vm_start, 0); + ASSERT_EQ(vma->vm_end, 0x4000); + ASSERT_EQ(vma->vm_pgoff, 0); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(mm.map_count, 1); + + cleanup_mm(&mm, &vmi); + return true; +} + +static bool test_copy_vma(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + bool need_locks = false; + VMA_ITERATOR(vmi, &mm, 0); + struct vm_area_struct *vma, *vma_new, *vma_next; + + /* Move backwards and do not merge. */ + + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_new = copy_vma(&vma, 0, 0x2000, 0, &need_locks); + + ASSERT_NE(vma_new, vma); + ASSERT_EQ(vma_new->vm_start, 0); + ASSERT_EQ(vma_new->vm_end, 0x2000); + ASSERT_EQ(vma_new->vm_pgoff, 0); + + cleanup_mm(&mm, &vmi); + + /* Move a VMA into position next to another and merge the two. */ + + vma = alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x8000, 6, flags); + vma_new = copy_vma(&vma, 0x4000, 0x2000, 4, &need_locks); + + ASSERT_EQ(vma_new, vma_next); + + cleanup_mm(&mm, &vmi); + return true; +} + int main(void) { int num_tests = 0, num_fail = 0; @@ -193,11 +1453,23 @@ int main(void) } \ } while (0) + /* Very simple tests to kick the tyres. */ TEST(simple_merge); TEST(simple_modify); TEST(simple_expand); TEST(simple_shrink); + TEST(merge_new); + TEST(vma_merge_special_flags); + TEST(vma_merge_with_close); + TEST(vma_merge_new_with_close); + TEST(merge_existing); + TEST(anon_vma_non_mergeable); + TEST(dup_anon_vma); + TEST(vmi_prealloc_fail); + TEST(merge_extend); + TEST(copy_vma); + #undef TEST printf("%d tests run, %d passed, %d failed.\n", diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 093560e5b2ac..a3c262c6eb73 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -81,8 +81,6 @@ #define AS_MM_ALL_LOCKS 2 -#define current NULL - /* We hardcode this for now. */ #define sysctl_max_map_count 0x1000000UL @@ -92,6 +90,12 @@ typedef struct pgprot { pgprotval_t pgprot; } pgprot_t; typedef unsigned long vm_flags_t; typedef __bitwise unsigned int vm_fault_t; +/* + * The shared stubs do not implement this, it amounts to an fprintf(STDERR,...) + * either way :) + */ +#define pr_warn_once pr_err + typedef struct refcount_struct { atomic_t refs; } refcount_t; @@ -100,9 +104,30 @@ struct kref { refcount_t refcount; }; +/* + * Define the task command name length as enum, then it can be visible to + * BPF programs. + */ +enum { + TASK_COMM_LEN = 16, +}; + +struct task_struct { + char comm[TASK_COMM_LEN]; + pid_t pid; + struct mm_struct *mm; +}; + +struct task_struct *get_current(void); +#define current get_current() + struct anon_vma { struct anon_vma *root; struct rb_root_cached rb_root; + + /* Test fields. */ + bool was_cloned; + bool was_unlinked; }; struct anon_vma_chain { @@ -682,13 +707,21 @@ static inline int vma_dup_policy(struct vm_area_struct *, struct vm_area_struct return 0; } -static inline int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *) +static inline int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { + /* For testing purposes. We indicate that an anon_vma has been cloned. */ + if (src->anon_vma != NULL) { + dst->anon_vma = src->anon_vma; + dst->anon_vma->was_cloned = true; + } + return 0; } -static inline void vma_start_write(struct vm_area_struct *) +static inline void vma_start_write(struct vm_area_struct *vma) { + /* Used to indicate to tests that a write operation has begun. */ + vma->vm_lock_seq++; } static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, @@ -759,8 +792,10 @@ static inline void vma_assert_write_locked(struct vm_area_struct *) { } -static inline void unlink_anon_vmas(struct vm_area_struct *) +static inline void unlink_anon_vmas(struct vm_area_struct *vma) { + /* For testing purposes, indicate that the anon_vma was unlinked. */ + vma->anon_vma->was_unlinked = true; } static inline void anon_vma_unlock_write(struct anon_vma *) From 2f1c6611b0a89afcb8641471af5f223c9caa01e0 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:15 +0100 Subject: [PATCH 269/416] mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify() Rather than passing around huge numbers of parameters to numerous helper functions, abstract them into a single struct that we thread through the operation, the vma_merge_struct ('vmg'). Adjust vma_merge() and vma_modify() to accept this parameter, as well as predicate functions can_vma_merge_before(), can_vma_merge_after(), and the vma_modify_...() helper functions. Also introduce VMG_STATE() and VMG_VMA_STATE() helper macros to allow for easy vmg declaration. We additionally remove the requirement that vma_merge() is passed a VMA object representing the candidate new VMA. Previously it used this to obtain the mm_struct, file and anon_vma properties of the proposed range (a rather confusing state of affairs), which are now provided by the vmg directly. We also remove the pgoff calculation previously performed vma_modify(), and instead calculate this in VMG_VMA_STATE() via the vma_pgoff_offset() helper. Link: https://lkml.kernel.org/r/a955aad09d81329f6fbeb636b2dd10cde7b73dab.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/mmap.c | 70 ++++++++------ mm/vma.c | 207 ++++++++++++++++++++++++---------------- mm/vma.h | 125 ++++++++++++++---------- tools/testing/vma/vma.c | 43 +-------- 4 files changed, 242 insertions(+), 203 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index fc726c4e98be..ca9c6939638b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1373,10 +1373,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long end = addr + len; unsigned long merge_start = addr, merge_end = end; bool writable_file_mapping = false; - pgoff_t vm_pgoff; int error = -ENOMEM; VMA_ITERATOR(vmi, mm, addr); + VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff); + vmg.file = file; /* Find the first overlapping VMA */ vma = vma_find(&vmi, end); init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false); @@ -1389,12 +1390,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (error) goto gather_failed; - next = vms.next; - prev = vms.prev; + next = vmg.next = vms.next; + prev = vmg.prev = vms.prev; vma = NULL; } else { - next = vma_next(&vmi); - prev = vma_prev(&vmi); + next = vmg.next = vma_next(&vmi); + prev = vmg.prev = vma_prev(&vmi); if (prev) vma_iter_next_range(&vmi); } @@ -1414,6 +1415,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vms.nr_accounted = 0; vm_flags |= VM_ACCOUNT; + vmg.flags = vm_flags; } if (vm_flags & VM_SPECIAL) @@ -1422,28 +1424,31 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Attempt to expand an old mapping */ /* Check next */ if (next && next->vm_start == end && !vma_policy(next) && - can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen, - NULL_VM_UFFD_CTX, NULL)) { + can_vma_merge_before(&vmg)) { merge_end = next->vm_end; vma = next; - vm_pgoff = next->vm_pgoff - pglen; + vmg.pgoff = next->vm_pgoff - pglen; + /* + * We set this here so if we will merge with the previous VMA in + * the code below, can_vma_merge_after() ensures anon_vma + * compatibility between prev and next. + */ + vmg.anon_vma = vma->anon_vma; + vmg.uffd_ctx = vma->vm_userfaultfd_ctx; } /* Check prev */ if (prev && prev->vm_end == addr && !vma_policy(prev) && - (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file, - pgoff, vma->vm_userfaultfd_ctx, NULL) : - can_vma_merge_after(prev, vm_flags, NULL, file, pgoff, - NULL_VM_UFFD_CTX, NULL))) { + can_vma_merge_after(&vmg)) { merge_start = prev->vm_start; vma = prev; - vm_pgoff = prev->vm_pgoff; + vmg.pgoff = prev->vm_pgoff; vma_prev(&vmi); /* Equivalent to going to the previous range */ } if (vma) { /* Actually expand, if possible */ - if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) { + if (!vma_expand(&vmi, vma, merge_start, merge_end, vmg.pgoff, next)) { khugepaged_enter_vma(vma, vm_flags); goto expanded; } @@ -1774,26 +1779,29 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * Expand the existing vma if possible; Note that singular lists do not * occur after forking, so the expand will only happen on new VMAs. */ - if (vma && vma->vm_end == addr && !vma_policy(vma) && - can_vma_merge_after(vma, flags, NULL, NULL, - addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) { - vma_iter_config(vmi, vma->vm_start, addr + len); - if (vma_iter_prealloc(vmi, vma)) - goto unacct_fail; + if (vma && vma->vm_end == addr && !vma_policy(vma)) { + VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr)); - vma_start_write(vma); + vmg.prev = vma; + if (can_vma_merge_after(&vmg)) { + vma_iter_config(vmi, vma->vm_start, addr + len); + if (vma_iter_prealloc(vmi, vma)) + goto unacct_fail; - init_vma_prep(&vp, vma); - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0); - vma->vm_end = addr + len; - vm_flags_set(vma, VM_SOFTDIRTY); - vma_iter_store(vmi, vma); + vma_start_write(vma); - vma_complete(&vp, vmi, mm); - validate_mm(mm); - khugepaged_enter_vma(vma, flags); - goto out; + init_vma_prep(&vp, vma); + vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0); + vma->vm_end = addr + len; + vm_flags_set(vma, VM_SOFTDIRTY); + vma_iter_store(vmi, vma); + + vma_complete(&vp, vmi, mm); + validate_mm(mm); + khugepaged_enter_vma(vma, flags); + goto out; + } } if (vma) diff --git a/mm/vma.c b/mm/vma.c index 1736bb237b2c..6be645240f07 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,16 +7,18 @@ #include "vma_internal.h" #include "vma.h" -/* - * If the vma has a ->close operation then the driver probably needs to release - * per-vma resources, so we don't attempt to merge those if the caller indicates - * the current vma may be removed as part of the merge. - */ -static inline bool is_mergeable_vma(struct vm_area_struct *vma, - struct file *file, unsigned long vm_flags, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name, bool may_remove_vma) +static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next) { + struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev; + /* + * If the vma has a ->close operation then the driver probably needs to + * release per-vma resources, so we don't attempt to merge those if the + * caller indicates the current vma may be removed as part of the merge, + * which is the case if we are attempting to merge the next VMA into + * this one. + */ + bool may_remove_vma = merge_next; + /* * VM_SOFTDIRTY should not prevent from VMA merging, if we * match the flags but dirty bit -- the caller should mark @@ -25,15 +27,15 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma, * the kernel to generate new VMAs when old one could be * extended instead. */ - if ((vma->vm_flags ^ vm_flags) & ~VM_SOFTDIRTY) + if ((vma->vm_flags ^ vmg->flags) & ~VM_SOFTDIRTY) return false; - if (vma->vm_file != file) + if (vma->vm_file != vmg->file) return false; if (may_remove_vma && vma->vm_ops && vma->vm_ops->close) return false; - if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx)) + if (!is_mergeable_vm_userfaultfd_ctx(vma, vmg->uffd_ctx)) return false; - if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) + if (!anon_vma_name_eq(anon_vma_name(vma), vmg->anon_name)) return false; return true; } @@ -94,16 +96,16 @@ static void init_multi_vma_prep(struct vma_prepare *vp, * We assume the vma may be removed as part of the merge. */ bool -can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) +can_vma_merge_before(struct vma_merge_struct *vmg) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, true) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { - if (vma->vm_pgoff == vm_pgoff) + pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start); + + if (is_mergeable_vma(vmg, /* merge_next = */ true) && + is_mergeable_anon_vma(vmg->anon_vma, vmg->next->anon_vma, vmg->next)) { + if (vmg->next->vm_pgoff == vmg->pgoff + pglen) return true; } + return false; } @@ -116,18 +118,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, * * We assume that vma is not removed as part of the merge. */ -bool -can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) +bool can_vma_merge_after(struct vma_merge_struct *vmg) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, false) && - is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { - pgoff_t vm_pglen; - - vm_pglen = vma_pages(vma); - if (vma->vm_pgoff + vm_pglen == vm_pgoff) + if (is_mergeable_vma(vmg, /* merge_next = */ false) && + is_mergeable_anon_vma(vmg->anon_vma, vmg->prev->anon_vma, vmg->prev)) { + if (vmg->prev->vm_pgoff + vma_pages(vmg->prev) == vmg->pgoff) return true; } return false; @@ -1017,16 +1012,10 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, * **** is not represented - it will be merged and the vma containing the * area is returned, or the function will return NULL */ -static struct vm_area_struct -*vma_merge(struct vma_iterator *vmi, struct vm_area_struct *prev, - struct vm_area_struct *src, unsigned long addr, unsigned long end, - unsigned long vm_flags, pgoff_t pgoff, struct mempolicy *policy, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) +static struct vm_area_struct *vma_merge(struct vma_merge_struct *vmg) { - struct mm_struct *mm = src->vm_mm; - struct anon_vma *anon_vma = src->anon_vma; - struct file *file = src->vm_file; + struct mm_struct *mm = vmg->mm; + struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *curr, *next, *res; struct vm_area_struct *vma, *adjust, *remove, *remove2; struct vm_area_struct *anon_dup = NULL; @@ -1036,16 +1025,18 @@ static struct vm_area_struct bool merge_prev = false; bool merge_next = false; bool vma_expanded = false; + unsigned long addr = vmg->start; + unsigned long end = vmg->end; unsigned long vma_start = addr; unsigned long vma_end = end; - pgoff_t pglen = (end - addr) >> PAGE_SHIFT; + pgoff_t pglen = PHYS_PFN(end - addr); long adj_start = 0; /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. */ - if (vm_flags & VM_SPECIAL) + if (vmg->flags & VM_SPECIAL) return NULL; /* Does the input range span an existing VMA? (cases 5 - 8) */ @@ -1053,27 +1044,26 @@ static struct vm_area_struct if (!curr || /* cases 1 - 4 */ end == curr->vm_end) /* cases 6 - 8, adjacent VMA */ - next = vma_lookup(mm, end); + next = vmg->next = vma_lookup(mm, end); else - next = NULL; /* case 5 */ + next = vmg->next = NULL; /* case 5 */ if (prev) { vma_start = prev->vm_start; vma_pgoff = prev->vm_pgoff; /* Can we merge the predecessor? */ - if (addr == prev->vm_end && mpol_equal(vma_policy(prev), policy) - && can_vma_merge_after(prev, vm_flags, anon_vma, file, - pgoff, vm_userfaultfd_ctx, anon_name)) { + if (addr == prev->vm_end && mpol_equal(vma_policy(prev), vmg->policy) + && can_vma_merge_after(vmg)) { + merge_prev = true; - vma_prev(vmi); + vma_prev(vmg->vmi); } } /* Can we merge the successor? */ - if (next && mpol_equal(policy, vma_policy(next)) && - can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, - vm_userfaultfd_ctx, anon_name)) { + if (next && mpol_equal(vmg->policy, vma_policy(next)) && + can_vma_merge_before(vmg)) { merge_next = true; } @@ -1164,13 +1154,13 @@ static struct vm_area_struct vma_expanded = true; if (vma_expanded) { - vma_iter_config(vmi, vma_start, vma_end); + vma_iter_config(vmg->vmi, vma_start, vma_end); } else { - vma_iter_config(vmi, adjust->vm_start + adj_start, + vma_iter_config(vmg->vmi, adjust->vm_start + adj_start, adjust->vm_end); } - if (vma_iter_prealloc(vmi, vma)) + if (vma_iter_prealloc(vmg->vmi, vma)) goto prealloc_fail; init_multi_vma_prep(&vp, vma, adjust, remove, remove2); @@ -1182,20 +1172,20 @@ static struct vm_area_struct vma_set_range(vma, vma_start, vma_end, vma_pgoff); if (vma_expanded) - vma_iter_store(vmi, vma); + vma_iter_store(vmg->vmi, vma); if (adj_start) { adjust->vm_start += adj_start; adjust->vm_pgoff += adj_start >> PAGE_SHIFT; if (adj_start < 0) { WARN_ON(vma_expanded); - vma_iter_store(vmi, next); + vma_iter_store(vmg->vmi, next); } } - vma_complete(&vp, vmi, mm); + vma_complete(&vp, vmg->vmi, mm); validate_mm(mm); - khugepaged_enter_vma(res, vm_flags); + khugepaged_enter_vma(res, vmg->flags); return res; prealloc_fail: @@ -1203,8 +1193,8 @@ prealloc_fail: unlink_anon_vmas(anon_dup); anon_vma_fail: - vma_iter_set(vmi, addr); - vma_iter_load(vmi); + vma_iter_set(vmg->vmi, addr); + vma_iter_load(vmg->vmi); return NULL; } @@ -1221,32 +1211,27 @@ anon_vma_fail: * The function returns either the merged VMA, the original VMA if a split was * required instead, or an error if the split failed. */ -struct vm_area_struct *vma_modify(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long vm_flags, - struct mempolicy *policy, - struct vm_userfaultfd_ctx uffd_ctx, - struct anon_vma_name *anon_name) +static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) { - pgoff_t pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + struct vm_area_struct *vma = vmg->vma; struct vm_area_struct *merged; - merged = vma_merge(vmi, prev, vma, start, end, vm_flags, - pgoff, policy, uffd_ctx, anon_name); + /* First, try to merge. */ + merged = vma_merge(vmg); if (merged) return merged; - if (vma->vm_start < start) { - int err = split_vma(vmi, vma, start, 1); + /* Split any preceding portion of the VMA. */ + if (vma->vm_start < vmg->start) { + int err = split_vma(vmg->vmi, vma, vmg->start, 1); if (err) return ERR_PTR(err); } - if (vma->vm_end > end) { - int err = split_vma(vmi, vma, end, 0); + /* Split any trailing portion of the VMA. */ + if (vma->vm_end > vmg->end) { + int err = split_vma(vmg->vmi, vma, vmg->end, 0); if (err) return ERR_PTR(err); @@ -1255,6 +1240,65 @@ struct vm_area_struct *vma_modify(struct vma_iterator *vmi, return vma; } +struct vm_area_struct *vma_modify_flags( + struct vma_iterator *vmi, struct vm_area_struct *prev, + struct vm_area_struct *vma, unsigned long start, unsigned long end, + unsigned long new_flags) +{ + VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); + + vmg.flags = new_flags; + + return vma_modify(&vmg); +} + +struct vm_area_struct +*vma_modify_flags_name(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, + unsigned long end, + unsigned long new_flags, + struct anon_vma_name *new_name) +{ + VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); + + vmg.flags = new_flags; + vmg.anon_name = new_name; + + return vma_modify(&vmg); +} + +struct vm_area_struct +*vma_modify_policy(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + struct mempolicy *new_pol) +{ + VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); + + vmg.policy = new_pol; + + return vma_modify(&vmg); +} + +struct vm_area_struct +*vma_modify_flags_uffd(struct vma_iterator *vmi, + struct vm_area_struct *prev, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags, + struct vm_userfaultfd_ctx new_ctx) +{ + VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); + + vmg.flags = new_flags; + vmg.uffd_ctx = new_ctx; + + return vma_modify(&vmg); +} + /* * Attempt to merge a newly mapped VMA with those adjacent to it. The caller * must ensure that [start, end) does not overlap any existing VMA. @@ -1264,8 +1308,11 @@ struct vm_area_struct struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff) { - return vma_merge(vmi, prev, vma, start, end, vma->vm_flags, pgoff, - vma_policy(vma), vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); + + vmg.pgoff = pgoff; + + return vma_merge(&vmg); } /* @@ -1276,12 +1323,10 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long delta) { - pgoff_t pgoff = vma->vm_pgoff + vma_pages(vma); + VMG_VMA_STATE(vmg, vmi, vma, vma, vma->vm_end, vma->vm_end + delta); /* vma is specified as prev, so case 1 or 2 will apply. */ - return vma_merge(vmi, vma, vma, vma->vm_end, vma->vm_end + delta, - vma->vm_flags, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + return vma_merge(&vmg); } void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) diff --git a/mm/vma.h b/mm/vma.h index 2a3a5c89a33b..e761713f855f 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -52,6 +52,59 @@ struct vma_munmap_struct { unsigned long data_vm; }; +/* Represents a VMA merge operation. */ +struct vma_merge_struct { + struct mm_struct *mm; + struct vma_iterator *vmi; + pgoff_t pgoff; + struct vm_area_struct *prev; + struct vm_area_struct *next; /* Modified by vma_merge(). */ + struct vm_area_struct *vma; /* Either a new VMA or the one being modified. */ + unsigned long start; + unsigned long end; + unsigned long flags; + struct file *file; + struct anon_vma *anon_vma; + struct mempolicy *policy; + struct vm_userfaultfd_ctx uffd_ctx; + struct anon_vma_name *anon_name; +}; + +/* Assumes addr >= vma->vm_start. */ +static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, + unsigned long addr) +{ + return vma->vm_pgoff + PHYS_PFN(addr - vma->vm_start); +} + +#define VMG_STATE(name, mm_, vmi_, start_, end_, flags_, pgoff_) \ + struct vma_merge_struct name = { \ + .mm = mm_, \ + .vmi = vmi_, \ + .start = start_, \ + .end = end_, \ + .flags = flags_, \ + .pgoff = pgoff_, \ + } + +#define VMG_VMA_STATE(name, vmi_, prev_, vma_, start_, end_) \ + struct vma_merge_struct name = { \ + .mm = vma_->vm_mm, \ + .vmi = vmi_, \ + .prev = prev_, \ + .next = NULL, \ + .vma = vma_, \ + .start = start_, \ + .end = end_, \ + .flags = vma_->vm_flags, \ + .pgoff = vma_pgoff_offset(vma_, start_), \ + .file = vma_->vm_file, \ + .anon_vma = vma_->anon_vma, \ + .policy = vma_policy(vma_), \ + .uffd_ctx = vma_->vm_userfaultfd_ctx, \ + .anon_name = anon_vma_name(vma_), \ + } + #ifdef CONFIG_DEBUG_VM_MAPLE_TREE void validate_mm(struct mm_struct *mm); #else @@ -212,80 +265,52 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed); void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, struct vm_area_struct *prev, struct vm_area_struct *next); -/* Required by mmap_region(). */ -bool -can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name); +/* + * Can we merge the VMA described by vmg into the following VMA vmg->next? + * + * Required by mmap_region(). + */ +bool can_vma_merge_before(struct vma_merge_struct *vmg); -/* Required by mmap_region() and do_brk_flags(). */ -bool -can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, - struct anon_vma *anon_vma, struct file *file, - pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name); - -struct vm_area_struct *vma_modify(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long vm_flags, - struct mempolicy *policy, - struct vm_userfaultfd_ctx uffd_ctx, - struct anon_vma_name *anon_name); +/* + * Can we merge the VMA described by vmg into the preceding VMA vmg->prev? + * + * Required by mmap_region() and do_brk_flags(). + */ +bool can_vma_merge_after(struct vma_merge_struct *vmg); /* We are about to modify the VMA's flags. */ -static inline struct vm_area_struct -*vma_modify_flags(struct vma_iterator *vmi, - struct vm_area_struct *prev, - struct vm_area_struct *vma, - unsigned long start, unsigned long end, - unsigned long new_flags) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, - anon_vma_name(vma)); -} +struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, + struct vm_area_struct *prev, struct vm_area_struct *vma, + unsigned long start, unsigned long end, + unsigned long new_flags); /* We are about to modify the VMA's flags and/or anon_name. */ -static inline struct vm_area_struct +struct vm_area_struct *vma_modify_flags_name(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long new_flags, - struct anon_vma_name *new_name) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), vma->vm_userfaultfd_ctx, new_name); -} + struct anon_vma_name *new_name); /* We are about to modify the VMA's memory policy. */ -static inline struct vm_area_struct +struct vm_area_struct *vma_modify_policy(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, - struct mempolicy *new_pol) -{ - return vma_modify(vmi, prev, vma, start, end, vma->vm_flags, - new_pol, vma->vm_userfaultfd_ctx, anon_vma_name(vma)); -} + struct mempolicy *new_pol); /* We are about to modify the VMA's flags and/or uffd context. */ -static inline struct vm_area_struct +struct vm_area_struct *vma_modify_flags_uffd(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long new_flags, - struct vm_userfaultfd_ctx new_ctx) -{ - return vma_modify(vmi, prev, vma, start, end, new_flags, - vma_policy(vma), new_ctx, anon_vma_name(vma)); -} + struct vm_userfaultfd_ctx new_ctx); struct vm_area_struct *vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 71bd30d5da81..7a3f59186464 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -22,26 +22,6 @@ static bool fail_prealloc; */ #include "../../../mm/vma.c" -/* - * Temporarily forward-ported from a future in which vmg's are used for merging. - */ -struct vma_merge_struct { - struct mm_struct *mm; - struct vma_iterator *vmi; - pgoff_t pgoff; - struct vm_area_struct *prev; - struct vm_area_struct *next; /* Modified by vma_merge(). */ - struct vm_area_struct *vma; /* Either a new VMA or the one being modified. */ - unsigned long start; - unsigned long end; - unsigned long flags; - struct file *file; - struct anon_vma *anon_vma; - struct mempolicy *policy; - struct vm_userfaultfd_ctx uffd_ctx; - struct anon_vma_name *anon_name; -}; - const struct vm_operations_struct vma_dummy_vm_ops; static struct anon_vma dummy_anon_vma; @@ -115,14 +95,6 @@ static struct vm_area_struct *alloc_and_link_vma(struct mm_struct *mm, /* Helper function which provides a wrapper around a merge new VMA operation. */ static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) { - /* vma_merge() needs a VMA to determine mm, anon_vma, and file. */ - struct vm_area_struct dummy = { - .vm_mm = vmg->mm, - .vm_flags = vmg->flags, - .anon_vma = vmg->anon_vma, - .vm_file = vmg->file, - }; - /* * For convenience, get prev and next VMAs. Which the new VMA operation * requires. @@ -131,8 +103,7 @@ static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) vmg->prev = vma_prev(vmg->vmi); vma_iter_set(vmg->vmi, vmg->start); - return vma_merge_new_vma(vmg->vmi, vmg->prev, &dummy, vmg->start, - vmg->end, vmg->pgoff); + return vma_merge(vmg); } /* @@ -141,17 +112,7 @@ static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) */ static struct vm_area_struct *merge_existing(struct vma_merge_struct *vmg) { - /* vma_merge() needs a VMA to determine mm, anon_vma, and file. */ - struct vm_area_struct dummy = { - .vm_mm = vmg->mm, - .vm_flags = vmg->flags, - .anon_vma = vmg->anon_vma, - .vm_file = vmg->file, - }; - - return vma_merge(vmg->vmi, vmg->prev, &dummy, vmg->start, vmg->end, - vmg->flags, vmg->pgoff, vmg->policy, vmg->uffd_ctx, - vmg->anon_name); + return vma_merge(vmg); } /* From 3e01310d29a700b99067cb0d6ba21284f2389308 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:16 +0100 Subject: [PATCH 270/416] mm: remove duplicated open-coded VMA policy check Both can_vma_merge_before() and can_vma_merge_after() are invoked after checking for compatible VMA NUMA policy, we can simply move this to is_mergeable_vma() and abstract this altogether. In mmap_region() we set vmg->policy to NULL, so the policy comparisons checked in can_vma_merge_before() and can_vma_merge_after() are exactly equivalent to !vma_policy(vmg.next) and !vma_policy(vmg.prev). Equally, in do_brk_flags(), vmg->policy is NULL, so the can_vma_merge_after() is checking !vma_policy(vma), as we set vmg.prev to vma. In vma_merge(), we compare prev and next policies with vmg->policy before checking can_vma_merge_after() and can_vma_merge_before() respectively, which this patch causes to be checked in precisely the same way. This therefore maintains precisely the same logic as before, only now abstracted into is_mergeable_vma(). Link: https://lkml.kernel.org/r/0dbff286d9c4988333bc6f4ff3734cb95dd5410a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Reviewed-by: Liam R. Howlett Cc: Mark Brown Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/mmap.c | 8 +++----- mm/vma.c | 9 ++++----- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index ca9c6939638b..3af8459e4e88 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1423,8 +1423,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Attempt to expand an old mapping */ /* Check next */ - if (next && next->vm_start == end && !vma_policy(next) && - can_vma_merge_before(&vmg)) { + if (next && next->vm_start == end && can_vma_merge_before(&vmg)) { merge_end = next->vm_end; vma = next; vmg.pgoff = next->vm_pgoff - pglen; @@ -1438,8 +1437,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, } /* Check prev */ - if (prev && prev->vm_end == addr && !vma_policy(prev) && - can_vma_merge_after(&vmg)) { + if (prev && prev->vm_end == addr && can_vma_merge_after(&vmg)) { merge_start = prev->vm_start; vma = prev; vmg.pgoff = prev->vm_pgoff; @@ -1779,7 +1777,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, * Expand the existing vma if possible; Note that singular lists do not * occur after forking, so the expand will only happen on new VMAs. */ - if (vma && vma->vm_end == addr && !vma_policy(vma)) { + if (vma && vma->vm_end == addr) { VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr)); vmg.prev = vma; diff --git a/mm/vma.c b/mm/vma.c index 6be645240f07..3284bb778c3d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -19,6 +19,8 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex */ bool may_remove_vma = merge_next; + if (!mpol_equal(vmg->policy, vma_policy(vma))) + return false; /* * VM_SOFTDIRTY should not prevent from VMA merging, if we * match the flags but dirty bit -- the caller should mark @@ -1053,17 +1055,14 @@ static struct vm_area_struct *vma_merge(struct vma_merge_struct *vmg) vma_pgoff = prev->vm_pgoff; /* Can we merge the predecessor? */ - if (addr == prev->vm_end && mpol_equal(vma_policy(prev), vmg->policy) - && can_vma_merge_after(vmg)) { - + if (addr == prev->vm_end && can_vma_merge_after(vmg)) { merge_prev = true; vma_prev(vmg->vmi); } } /* Can we merge the successor? */ - if (next && mpol_equal(vmg->policy, vma_policy(next)) && - can_vma_merge_before(vmg)) { + if (next && can_vma_merge_before(vmg)) { merge_next = true; } From fc21959f74bc1138b28e90a02ec224ab8626111e Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:17 +0100 Subject: [PATCH 271/416] mm: abstract vma_expand() to use vma_merge_struct The purpose of the vmg is to thread merge state through functions and avoid egregious parameter lists. We expand this to vma_expand(), which is used for a number of merge cases. Accordingly, adjust its callers, mmap_region() and relocate_vma_down(), to use a vmg. An added purpose of this change is the ability in a future commit to perform all new VMA range merging using vma_expand(). Link: https://lkml.kernel.org/r/4bc8c9dbc9ca52452ef8e587b28fe555854ceb38.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/mmap.c | 15 ++++++++------- mm/vma.c | 39 +++++++++++++++++---------------------- mm/vma.h | 5 +---- tools/testing/vma/vma.c | 3 +-- 4 files changed, 27 insertions(+), 35 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3af8459e4e88..2b3006efd3fb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1371,7 +1371,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct ma_state mas_detach; struct maple_tree mt_detach; unsigned long end = addr + len; - unsigned long merge_start = addr, merge_end = end; bool writable_file_mapping = false; int error = -ENOMEM; VMA_ITERATOR(vmi, mm, addr); @@ -1424,8 +1423,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Attempt to expand an old mapping */ /* Check next */ if (next && next->vm_start == end && can_vma_merge_before(&vmg)) { - merge_end = next->vm_end; - vma = next; + vmg.end = next->vm_end; + vma = vmg.vma = next; vmg.pgoff = next->vm_pgoff - pglen; /* * We set this here so if we will merge with the previous VMA in @@ -1438,15 +1437,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Check prev */ if (prev && prev->vm_end == addr && can_vma_merge_after(&vmg)) { - merge_start = prev->vm_start; - vma = prev; + vmg.start = prev->vm_start; + vma = vmg.vma = prev; vmg.pgoff = prev->vm_pgoff; vma_prev(&vmi); /* Equivalent to going to the previous range */ } if (vma) { /* Actually expand, if possible */ - if (!vma_expand(&vmi, vma, merge_start, merge_end, vmg.pgoff, next)) { + if (!vma_expand(&vmg)) { khugepaged_enter_vma(vma, vm_flags); goto expanded; } @@ -2320,6 +2319,7 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; VMA_ITERATOR(vmi, mm, new_start); + VMG_STATE(vmg, mm, &vmi, new_start, old_end, 0, vma->vm_pgoff); struct vm_area_struct *next; struct mmu_gather tlb; @@ -2336,7 +2336,8 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift) /* * cover the whole range: [new_start, old_end) */ - if (vma_expand(&vmi, vma, new_start, old_end, vma->vm_pgoff, NULL)) + vmg.vma = vma; + if (vma_expand(&vmg)) return -ENOMEM; /* diff --git a/mm/vma.c b/mm/vma.c index 3284bb778c3d..d1033dade70e 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -467,30 +467,25 @@ void validate_mm(struct mm_struct *mm) /* * vma_expand - Expand an existing VMA * - * @vmi: The vma iterator - * @vma: The vma to expand - * @start: The start of the vma - * @end: The exclusive end of the vma - * @pgoff: The page offset of vma - * @next: The current of next vma. + * @vmg: Describes a VMA expansion operation. * - * Expand @vma to @start and @end. Can expand off the start and end. Will - * expand over @next if it's different from @vma and @end == @next->vm_end. - * Checking if the @vma can expand and merge with @next needs to be handled by - * the caller. + * Expand @vma to vmg->start and vmg->end. Can expand off the start and end. + * Will expand over vmg->next if it's different from vmg->vma and vmg->end == + * vmg->next->vm_end. Checking if the vmg->vma can expand and merge with + * vmg->next needs to be handled by the caller. * * Returns: 0 on success */ -int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff, - struct vm_area_struct *next) +int vma_expand(struct vma_merge_struct *vmg) { struct vm_area_struct *anon_dup = NULL; bool remove_next = false; + struct vm_area_struct *vma = vmg->vma; + struct vm_area_struct *next = vmg->next; struct vma_prepare vp; vma_start_write(vma); - if (next && (vma != next) && (end == next->vm_end)) { + if (next && (vma != next) && (vmg->end == next->vm_end)) { int ret; remove_next = true; @@ -503,21 +498,21 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); /* Not merging but overwriting any part of next is not handled. */ VM_WARN_ON(next && !vp.remove && - next != vma && end > next->vm_start); + next != vma && vmg->end > next->vm_start); /* Only handles expanding */ - VM_WARN_ON(vma->vm_start < start || vma->vm_end > end); + VM_WARN_ON(vma->vm_start < vmg->start || vma->vm_end > vmg->end); /* Note: vma iterator must be pointing to 'start' */ - vma_iter_config(vmi, start, end); - if (vma_iter_prealloc(vmi, vma)) + vma_iter_config(vmg->vmi, vmg->start, vmg->end); + if (vma_iter_prealloc(vmg->vmi, vma)) goto nomem; vma_prepare(&vp); - vma_adjust_trans_huge(vma, start, end, 0); - vma_set_range(vma, start, end, pgoff); - vma_iter_store(vmi, vma); + vma_adjust_trans_huge(vma, vmg->start, vmg->end, 0); + vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff); + vma_iter_store(vmg->vmi, vma); - vma_complete(&vp, vmi, vma->vm_mm); + vma_complete(&vp, vmg->vmi, vma->vm_mm); return 0; nomem: diff --git a/mm/vma.h b/mm/vma.h index e761713f855f..218d884ff5ff 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -128,10 +128,7 @@ void init_vma_prep(struct vma_prepare *vp, void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, struct mm_struct *mm); -int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, - unsigned long start, unsigned long end, pgoff_t pgoff, - struct vm_area_struct *next); - +int vma_expand(struct vma_merge_struct *vmg); int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 7a3f59186464..f6c4706a861f 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -121,8 +121,7 @@ static struct vm_area_struct *merge_existing(struct vma_merge_struct *vmg) */ static int expand_existing(struct vma_merge_struct *vmg) { - return vma_expand(vmg->vmi, vmg->vma, vmg->start, vmg->end, vmg->pgoff, - vmg->next); + return vma_expand(vmg); } /* From cacded5e42b9609b07b22d80c10f0076d439f7d1 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:18 +0100 Subject: [PATCH 272/416] mm: avoid using vma_merge() for new VMAs Abstract vma_merge_new_vma() to use vma_merge_struct and rename the resultant function vma_merge_new_range() to be clear what the purpose of this function is - a new VMA is desired in the specified range, and we wish to see if it is possible to 'merge' surrounding VMAs into this range rather than having to allocate a new VMA. Note that this function uses vma_extend() exclusively, so adopts its requirement that the iterator point at or before the gap. We add an assert to this effect. This is as opposed to vma_merge_existing_range(), which will be introduced in a subsequent commit, and provide the same functionality for cases in which we are modifying an existing VMA. In mmap_region() and do_brk_flags() we open code scenarios where we prefer to use vma_expand() rather than invoke a full vma_merge() operation. Abstract this logic and eliminate all of the open-coding, and also use the same logic for all cases where we add new VMAs to, rather than ultimately use vma_merge(), rather use vma_expand(). Doing so removes duplication and simplifies VMA merging in all such cases, laying the ground for us to eliminate the merging of new VMAs in vma_merge() altogether. Also add the ability for the vmg to track state, and able to report errors, allowing for us to differentiate a failed merge from an inability to allocate memory in callers. This makes it far easier to understand what is happening in these cases avoiding confusion, bugs and allowing for future optimisation. Also introduce vma_iter_next_rewind() to allow for retrieval of the next, and (optionally) the prev VMA, rewinding to the start of the previous gap. Introduce are_anon_vmas_compatible() to abstract individual VMA anon_vma comparison for the case of merging on both sides where the anon_vma of the VMA being merged maybe compatible with prev and next, but prev and next's anon_vma's may not be compatible with each other. Finally also introduce can_vma_merge_left() / can_vma_merge_right() to check adjacent VMA compatibility and that they are indeed adjacent. Link: https://lkml.kernel.org/r/49d37c0769b6b9dc03b27fe4d059173832556392.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Tested-by: Mark Brown Cc: Liam R. Howlett Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/mmap.c | 90 +++----------- mm/vma.c | 200 +++++++++++++++++++++++++++---- mm/vma.h | 48 +++++++- tools/testing/vma/vma.c | 33 ++++- tools/testing/vma/vma_internal.h | 6 + 5 files changed, 278 insertions(+), 99 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 2b3006efd3fb..02f7b45c3076 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1364,8 +1364,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; - struct vm_area_struct *next, *prev, *merge; pgoff_t pglen = PHYS_PFN(len); + struct vm_area_struct *merge; unsigned long charged = 0; struct vma_munmap_struct vms; struct ma_state mas_detach; @@ -1389,14 +1389,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (error) goto gather_failed; - next = vmg.next = vms.next; - prev = vmg.prev = vms.prev; + vmg.next = vms.next; + vmg.prev = vms.prev; vma = NULL; } else { - next = vmg.next = vma_next(&vmi); - prev = vmg.prev = vma_prev(&vmi); - if (prev) - vma_iter_next_range(&vmi); + vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev); } /* Check against address space limit. */ @@ -1417,46 +1414,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vmg.flags = vm_flags; } - if (vm_flags & VM_SPECIAL) - goto cannot_expand; - - /* Attempt to expand an old mapping */ - /* Check next */ - if (next && next->vm_start == end && can_vma_merge_before(&vmg)) { - vmg.end = next->vm_end; - vma = vmg.vma = next; - vmg.pgoff = next->vm_pgoff - pglen; - /* - * We set this here so if we will merge with the previous VMA in - * the code below, can_vma_merge_after() ensures anon_vma - * compatibility between prev and next. - */ - vmg.anon_vma = vma->anon_vma; - vmg.uffd_ctx = vma->vm_userfaultfd_ctx; - } - - /* Check prev */ - if (prev && prev->vm_end == addr && can_vma_merge_after(&vmg)) { - vmg.start = prev->vm_start; - vma = vmg.vma = prev; - vmg.pgoff = prev->vm_pgoff; - vma_prev(&vmi); /* Equivalent to going to the previous range */ - } - - if (vma) { - /* Actually expand, if possible */ - if (!vma_expand(&vmg)) { - khugepaged_enter_vma(vma, vm_flags); - goto expanded; - } - - /* If the expand fails, then reposition the vma iterator */ - if (unlikely(vma == prev)) - vma_iter_set(&vmi, addr); - } - -cannot_expand: - + vma = vma_merge_new_range(&vmg); + if (vma) + goto expanded; /* * Determine the object being mapped and call the appropriate * specific mapper. the address has already been validated, but @@ -1503,10 +1463,11 @@ cannot_expand: * If vm_flags changed after call_mmap(), we should try merge * vma again as we may succeed this time. */ - if (unlikely(vm_flags != vma->vm_flags && prev)) { - merge = vma_merge_new_vma(&vmi, prev, vma, - vma->vm_start, vma->vm_end, - vma->vm_pgoff); + if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) { + vmg.flags = vma->vm_flags; + /* If this fails, state is reset ready for a reattempt. */ + merge = vma_merge_new_range(&vmg); + if (merge) { /* * ->mmap() can change vma->vm_file and fput @@ -1522,6 +1483,7 @@ cannot_expand: vm_flags = vma->vm_flags; goto unmap_writable; } + vma_iter_config(&vmi, addr, end); } vm_flags = vma->vm_flags; @@ -1554,7 +1516,7 @@ cannot_expand: vma_link_file(vma); /* - * vma_merge() calls khugepaged_enter_vma() either, the below + * vma_merge_new_range() calls khugepaged_enter_vma() too, the below * call covers the non-merge case. */ khugepaged_enter_vma(vma, vma->vm_flags); @@ -1609,7 +1571,7 @@ unmap_and_free_vma: vma_iter_set(&vmi, vma->vm_end); /* Undo any partial mapping done by a device driver. */ - unmap_region(&vmi.mas, vma, prev, next); + unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); } if (writable_file_mapping) mapping_unmap_writable(file->f_mapping); @@ -1756,7 +1718,6 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, unsigned long len, unsigned long flags) { struct mm_struct *mm = current->mm; - struct vma_prepare vp; /* * Check against address space limits by the changed size @@ -1780,25 +1741,12 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr)); vmg.prev = vma; - if (can_vma_merge_after(&vmg)) { - vma_iter_config(vmi, vma->vm_start, addr + len); - if (vma_iter_prealloc(vmi, vma)) - goto unacct_fail; + vma_iter_next_range(vmi); - vma_start_write(vma); - - init_vma_prep(&vp, vma); - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0); - vma->vm_end = addr + len; - vm_flags_set(vma, VM_SOFTDIRTY); - vma_iter_store(vmi, vma); - - vma_complete(&vp, vmi, mm); - validate_mm(mm); - khugepaged_enter_vma(vma, flags); + if (vma_merge_new_range(&vmg)) goto out; - } + else if (vmg_nomem(&vmg)) + goto unacct_fail; } if (vma) diff --git a/mm/vma.c b/mm/vma.c index d1033dade70e..749c4881fd60 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -55,6 +55,13 @@ static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1, return anon_vma1 == anon_vma2; } +/* Are the anon_vma's belonging to each VMA compatible with one another? */ +static inline bool are_anon_vmas_compatible(struct vm_area_struct *vma1, + struct vm_area_struct *vma2) +{ + return is_mergeable_anon_vma(vma1->anon_vma, vma2->anon_vma, NULL); +} + /* * init_multi_vma_prep() - Initializer for struct vma_prepare * @vp: The vma_prepare struct @@ -130,6 +137,44 @@ bool can_vma_merge_after(struct vma_merge_struct *vmg) return false; } +/* + * Can the proposed VMA be merged with the left (previous) VMA taking into + * account the start position of the proposed range. + */ +static bool can_vma_merge_left(struct vma_merge_struct *vmg) + +{ + return vmg->prev && vmg->prev->vm_end == vmg->start && + can_vma_merge_after(vmg); +} + +/* + * Can the proposed VMA be merged with the right (next) VMA taking into + * account the end position of the proposed range. + * + * In addition, if we can merge with the left VMA, ensure that left and right + * anon_vma's are also compatible. + */ +static bool can_vma_merge_right(struct vma_merge_struct *vmg, + bool can_merge_left) +{ + if (!vmg->next || vmg->end != vmg->next->vm_start || + !can_vma_merge_before(vmg)) + return false; + + if (!can_merge_left) + return true; + + /* + * If we can merge with prev (left) and next (right), indicating that + * each VMA's anon_vma is compatible with the proposed anon_vma, this + * does not mean prev and next are compatible with EACH OTHER. + * + * We therefore check this in addition to mergeability to either side. + */ + return are_anon_vmas_compatible(vmg->prev, vmg->next); +} + /* * Close a vm structure and free it. */ @@ -464,6 +509,111 @@ void validate_mm(struct mm_struct *mm) } #endif /* CONFIG_DEBUG_VM_MAPLE_TREE */ +/* + * vma_merge_new_range - Attempt to merge a new VMA into address space + * + * @vmg: Describes the VMA we are adding, in the range @vmg->start to @vmg->end + * (exclusive), which we try to merge with any adjacent VMAs if possible. + * + * We are about to add a VMA to the address space starting at @vmg->start and + * ending at @vmg->end. There are three different possible scenarios: + * + * 1. There is a VMA with identical properties immediately adjacent to the + * proposed new VMA [@vmg->start, @vmg->end) either before or after it - + * EXPAND that VMA: + * + * Proposed: |-----| or |-----| + * Existing: |----| |----| + * + * 2. There are VMAs with identical properties immediately adjacent to the + * proposed new VMA [@vmg->start, @vmg->end) both before AND after it - + * EXPAND the former and REMOVE the latter: + * + * Proposed: |-----| + * Existing: |----| |----| + * + * 3. There are no VMAs immediately adjacent to the proposed new VMA or those + * VMAs do not have identical attributes - NO MERGE POSSIBLE. + * + * In instances where we can merge, this function returns the expanded VMA which + * will have its range adjusted accordingly and the underlying maple tree also + * adjusted. + * + * Returns: In instances where no merge was possible, NULL. Otherwise, a pointer + * to the VMA we expanded. + * + * This function adjusts @vmg to provide @vmg->next if not already specified, + * and adjusts [@vmg->start, @vmg->end) to span the expanded range. + * + * ASSUMPTIONS: + * - The caller must hold a WRITE lock on the mm_struct->mmap_lock. + * - The caller must have determined that [@vmg->start, @vmg->end) is empty, + other than VMAs that will be unmapped should the operation succeed. + * - The caller must have specified the previous vma in @vmg->prev. + * - The caller must have specified the next vma in @vmg->next. + * - The caller must have positioned the vmi at or before the gap. + */ +struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) +{ + struct vm_area_struct *prev = vmg->prev; + struct vm_area_struct *next = vmg->next; + unsigned long start = vmg->start; + unsigned long end = vmg->end; + pgoff_t pgoff = vmg->pgoff; + pgoff_t pglen = PHYS_PFN(end - start); + bool can_merge_left, can_merge_right; + + mmap_assert_write_locked(vmg->mm); + VM_WARN_ON(vmg->vma); + /* vmi must point at or before the gap. */ + VM_WARN_ON(vma_iter_addr(vmg->vmi) > end); + + vmg->state = VMA_MERGE_NOMERGE; + + /* Special VMAs are unmergeable, also if no prev/next. */ + if ((vmg->flags & VM_SPECIAL) || (!prev && !next)) + return NULL; + + can_merge_left = can_vma_merge_left(vmg); + can_merge_right = can_vma_merge_right(vmg, can_merge_left); + + /* If we can merge with the next VMA, adjust vmg accordingly. */ + if (can_merge_right) { + vmg->end = next->vm_end; + vmg->vma = next; + vmg->pgoff = next->vm_pgoff - pglen; + } + + /* If we can merge with the previous VMA, adjust vmg accordingly. */ + if (can_merge_left) { + vmg->start = prev->vm_start; + vmg->vma = prev; + vmg->pgoff = prev->vm_pgoff; + + vma_prev(vmg->vmi); /* Equivalent to going to the previous range */ + } + + /* + * Now try to expand adjacent VMA(s). This takes care of removing the + * following VMA if we have VMAs on both sides. + */ + if (vmg->vma && !vma_expand(vmg)) { + khugepaged_enter_vma(vmg->vma, vmg->flags); + vmg->state = VMA_MERGE_SUCCESS; + return vmg->vma; + } + + /* If expansion failed, reset state. Allows us to retry merge later. */ + vmg->vma = NULL; + vmg->start = start; + vmg->end = end; + vmg->pgoff = pgoff; + if (vmg->vma == prev) + vma_iter_set(vmg->vmi, start); + + return NULL; +} + /* * vma_expand - Expand an existing VMA * @@ -474,7 +624,11 @@ void validate_mm(struct mm_struct *mm) * vmg->next->vm_end. Checking if the vmg->vma can expand and merge with * vmg->next needs to be handled by the caller. * - * Returns: 0 on success + * Returns: 0 on success. + * + * ASSUMPTIONS: + * - The caller must hold a WRITE lock on vmg->vma->mm->mmap_lock. + * - The caller must have set @vmg->vma and @vmg->next. */ int vma_expand(struct vma_merge_struct *vmg) { @@ -484,6 +638,8 @@ int vma_expand(struct vma_merge_struct *vmg) struct vm_area_struct *next = vmg->next; struct vma_prepare vp; + mmap_assert_write_locked(vmg->mm); + vma_start_write(vma); if (next && (vma != next) && (vmg->end == next->vm_end)) { int ret; @@ -516,6 +672,7 @@ int vma_expand(struct vma_merge_struct *vmg) return 0; nomem: + vmg->state = VMA_MERGE_ERROR_NOMEM; if (anon_dup) unlink_anon_vmas(anon_dup); return -ENOMEM; @@ -1029,6 +1186,8 @@ static struct vm_area_struct *vma_merge(struct vma_merge_struct *vmg) pgoff_t pglen = PHYS_PFN(end - addr); long adj_start = 0; + vmg->state = VMA_MERGE_NOMERGE; + /* * We later require that vma->vm_flags == vm_flags, * so this tests vma->vm_flags & VM_SPECIAL, too. @@ -1180,13 +1339,19 @@ static struct vm_area_struct *vma_merge(struct vma_merge_struct *vmg) vma_complete(&vp, vmg->vmi, mm); validate_mm(mm); khugepaged_enter_vma(res, vmg->flags); + + vmg->state = VMA_MERGE_SUCCESS; return res; prealloc_fail: + vmg->state = VMA_MERGE_ERROR_NOMEM; if (anon_dup) unlink_anon_vmas(anon_dup); anon_vma_fail: + if (err == -ENOMEM) + vmg->state = VMA_MERGE_ERROR_NOMEM; + vma_iter_set(vmg->vmi, addr); vma_iter_load(vmg->vmi); return NULL; @@ -1293,22 +1458,6 @@ struct vm_area_struct return vma_modify(&vmg); } -/* - * Attempt to merge a newly mapped VMA with those adjacent to it. The caller - * must ensure that [start, end) does not overlap any existing VMA. - */ -struct vm_area_struct -*vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, - struct vm_area_struct *vma, unsigned long start, - unsigned long end, pgoff_t pgoff) -{ - VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); - - vmg.pgoff = pgoff; - - return vma_merge(&vmg); -} - /* * Expand vma by delta bytes, potentially merging with an immediately adjacent * VMA with identical properties. @@ -1319,8 +1468,10 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, { VMG_VMA_STATE(vmg, vmi, vma, vma, vma->vm_end, vma->vm_end + delta); - /* vma is specified as prev, so case 1 or 2 will apply. */ - return vma_merge(&vmg); + vmg.next = vma_iter_next_rewind(vmi, NULL); + vmg.vma = NULL; /* We use the VMA to populate VMG fields only. */ + + return vma_merge_new_range(&vmg); } void unlink_file_vma_batch_init(struct unlink_vma_file_batch *vb) @@ -1421,9 +1572,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, struct vm_area_struct *vma = *vmap; unsigned long vma_start = vma->vm_start; struct mm_struct *mm = vma->vm_mm; - struct vm_area_struct *new_vma, *prev; + struct vm_area_struct *new_vma; bool faulted_in_anon_vma = true; VMA_ITERATOR(vmi, mm, addr); + VMG_VMA_STATE(vmg, &vmi, NULL, vma, addr, addr + len); /* * If anonymous vma has not yet been faulted, update new pgoff @@ -1434,11 +1586,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, faulted_in_anon_vma = false; } - new_vma = find_vma_prev(mm, addr, &prev); + new_vma = find_vma_prev(mm, addr, &vmg.prev); if (new_vma && new_vma->vm_start < addr + len) return NULL; /* should never get here */ - new_vma = vma_merge_new_vma(&vmi, prev, vma, addr, addr + len, pgoff); + vmg.vma = NULL; /* New VMA range. */ + vmg.pgoff = pgoff; + vmg.next = vma_iter_next_rewind(&vmi, NULL); + new_vma = vma_merge_new_range(&vmg); + if (new_vma) { /* * Source vma may have been merged into new_vma diff --git a/mm/vma.h b/mm/vma.h index 218d884ff5ff..82354fe5edd0 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -52,6 +52,13 @@ struct vma_munmap_struct { unsigned long data_vm; }; +enum vma_merge_state { + VMA_MERGE_START, + VMA_MERGE_ERROR_NOMEM, + VMA_MERGE_NOMERGE, + VMA_MERGE_SUCCESS, +}; + /* Represents a VMA merge operation. */ struct vma_merge_struct { struct mm_struct *mm; @@ -68,8 +75,14 @@ struct vma_merge_struct { struct mempolicy *policy; struct vm_userfaultfd_ctx uffd_ctx; struct anon_vma_name *anon_name; + enum vma_merge_state state; }; +static inline bool vmg_nomem(struct vma_merge_struct *vmg) +{ + return vmg->state == VMA_MERGE_ERROR_NOMEM; +} + /* Assumes addr >= vma->vm_start. */ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, unsigned long addr) @@ -85,6 +98,7 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, .end = end_, \ .flags = flags_, \ .pgoff = pgoff_, \ + .state = VMA_MERGE_START, \ } #define VMG_VMA_STATE(name, vmi_, prev_, vma_, start_, end_) \ @@ -103,6 +117,7 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, .policy = vma_policy(vma_), \ .uffd_ctx = vma_->vm_userfaultfd_ctx, \ .anon_name = anon_vma_name(vma_), \ + .state = VMA_MERGE_START, \ } #ifdef CONFIG_DEBUG_VM_MAPLE_TREE @@ -309,10 +324,7 @@ struct vm_area_struct unsigned long new_flags, struct vm_userfaultfd_ctx new_ctx); -struct vm_area_struct -*vma_merge_new_vma(struct vma_iterator *vmi, struct vm_area_struct *prev, - struct vm_area_struct *vma, unsigned long start, - unsigned long end, pgoff_t pgoff); +struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg); struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi, struct vm_area_struct *vma, @@ -505,6 +517,34 @@ struct vm_area_struct *vma_iter_prev_range(struct vma_iterator *vmi) return mas_prev_range(&vmi->mas, 0); } +/* + * Retrieve the next VMA and rewind the iterator to end of the previous VMA, or + * if no previous VMA, to index 0. + */ +static inline +struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi, + struct vm_area_struct **pprev) +{ + struct vm_area_struct *next = vma_next(vmi); + struct vm_area_struct *prev = vma_prev(vmi); + + /* + * Consider the case where no previous VMA exists. We advance to the + * next VMA, skipping any gap, then rewind to the start of the range. + * + * If we were to unconditionally advance to the next range we'd wind up + * at the next VMA again, so we check to ensure there is a previous VMA + * to skip over. + */ + if (prev) + vma_iter_next_range(vmi); + + if (pprev) + *pprev = prev; + + return next; +} + #ifdef CONFIG_64BIT static inline bool vma_is_sealed(struct vm_area_struct *vma) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index f6c4706a861f..b7cdafec09af 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -101,9 +101,9 @@ static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) */ vmg->next = vma_next(vmg->vmi); vmg->prev = vma_prev(vmg->vmi); + vma_iter_next_range(vmg->vmi); - vma_iter_set(vmg->vmi, vmg->start); - return vma_merge(vmg); + return vma_merge_new_range(vmg); } /* @@ -162,10 +162,14 @@ static struct vm_area_struct *try_merge_new_vma(struct mm_struct *mm, merged = merge_new(vmg); if (merged) { *was_merged = true; + ASSERT_EQ(vmg->state, VMA_MERGE_SUCCESS); return merged; } *was_merged = false; + + ASSERT_EQ(vmg->state, VMA_MERGE_NOMERGE); + return alloc_and_link_vma(mm, start, end, pgoff, flags); } @@ -595,6 +599,7 @@ static bool test_vma_merge_special_flags(void) vmg.flags = flags | special_flag; vma = merge_new(&vmg); ASSERT_EQ(vma, NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); } /* 2. Modify VMA with special flag that would otherwise merge. */ @@ -616,6 +621,7 @@ static bool test_vma_merge_special_flags(void) vmg.flags = flags | special_flag; vma = merge_existing(&vmg); ASSERT_EQ(vma, NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); } cleanup_mm(&mm, &vmi); @@ -708,6 +714,7 @@ static bool test_vma_merge_with_close(void) /* The next VMA having a close() operator should cause the merge to fail.*/ ASSERT_EQ(merge_new(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); /* Now create the VMA so we can merge via modified flags */ vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); @@ -719,6 +726,7 @@ static bool test_vma_merge_with_close(void) * also fail. */ ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); /* SCENARIO B * @@ -744,6 +752,7 @@ static bool test_vma_merge_with_close(void) vmg.vma = vma; /* Make sure merge does not occur. */ ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); cleanup_mm(&mm, &vmi); return true; @@ -792,6 +801,7 @@ static bool test_vma_merge_new_with_close(void) vmg_set_range(&vmg, 0x2000, 0x5000, 2, flags); vma = merge_new(&vmg); ASSERT_NE(vma, NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma->vm_start, 0); ASSERT_EQ(vma->vm_end, 0x5000); ASSERT_EQ(vma->vm_pgoff, 0); @@ -831,6 +841,7 @@ static bool test_merge_existing(void) vmg.prev = vma; vma->anon_vma = &dummy_anon_vma; ASSERT_EQ(merge_existing(&vmg), vma_next); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_next->vm_start, 0x3000); ASSERT_EQ(vma_next->vm_end, 0x9000); ASSERT_EQ(vma_next->vm_pgoff, 3); @@ -861,6 +872,7 @@ static bool test_merge_existing(void) vmg.vma = vma; vma->anon_vma = &dummy_anon_vma; ASSERT_EQ(merge_existing(&vmg), vma_next); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_next->vm_start, 0x2000); ASSERT_EQ(vma_next->vm_end, 0x9000); ASSERT_EQ(vma_next->vm_pgoff, 2); @@ -889,6 +901,7 @@ static bool test_merge_existing(void) vma->anon_vma = &dummy_anon_vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x6000); ASSERT_EQ(vma_prev->vm_pgoff, 0); @@ -920,6 +933,7 @@ static bool test_merge_existing(void) vmg.vma = vma; vma->anon_vma = &dummy_anon_vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x7000); ASSERT_EQ(vma_prev->vm_pgoff, 0); @@ -948,6 +962,7 @@ static bool test_merge_existing(void) vmg.vma = vma; vma->anon_vma = &dummy_anon_vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x9000); ASSERT_EQ(vma_prev->vm_pgoff, 0); @@ -981,31 +996,37 @@ static bool test_merge_existing(void) vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); vmg_set_range(&vmg, 0x6000, 0x7000, 6, flags); vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); vmg_set_range(&vmg, 0x4000, 0x7000, 4, flags); vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); vmg_set_range(&vmg, 0x4000, 0x6000, 4, flags); vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); vmg_set_range(&vmg, 0x5000, 0x6000, 5, flags); vmg.prev = vma; vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); ASSERT_EQ(cleanup_mm(&mm, &vmi), 3); @@ -1071,6 +1092,7 @@ static bool test_anon_vma_non_mergeable(void) vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x7000); ASSERT_EQ(vma_prev->vm_pgoff, 0); @@ -1106,6 +1128,7 @@ static bool test_anon_vma_non_mergeable(void) vmg.prev = vma_prev; ASSERT_EQ(merge_new(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x7000); ASSERT_EQ(vma_prev->vm_pgoff, 0); @@ -1181,6 +1204,7 @@ static bool test_dup_anon_vma(void) vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x8000); @@ -1209,6 +1233,7 @@ static bool test_dup_anon_vma(void) vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x8000); @@ -1236,6 +1261,7 @@ static bool test_dup_anon_vma(void) vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_prev->vm_start, 0); ASSERT_EQ(vma_prev->vm_end, 0x5000); @@ -1263,6 +1289,7 @@ static bool test_dup_anon_vma(void) vmg.vma = vma; ASSERT_EQ(merge_existing(&vmg), vma_next); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); ASSERT_EQ(vma_next->vm_start, 0x3000); ASSERT_EQ(vma_next->vm_end, 0x8000); @@ -1303,6 +1330,7 @@ static bool test_vmi_prealloc_fail(void) /* This will cause the merge to fail. */ ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_ERROR_NOMEM); /* We will already have assigned the anon_vma. */ ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); /* And it was both cloned and unlinked. */ @@ -1327,6 +1355,7 @@ static bool test_vmi_prealloc_fail(void) fail_prealloc = true; ASSERT_EQ(expand_existing(&vmg), -ENOMEM); + ASSERT_EQ(vmg.state, VMA_MERGE_ERROR_NOMEM); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(dummy_anon_vma.was_cloned); diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index a3c262c6eb73..c5b9da034511 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -740,6 +740,12 @@ static inline void vma_iter_free(struct vma_iterator *vmi) mas_destroy(&vmi->mas); } +static inline +struct vm_area_struct *vma_iter_next_range(struct vma_iterator *vmi) +{ + return mas_next_range(&vmi->mas, ULONG_MAX); +} + static inline void vm_acct_memory(long pages) { } From 25d3925fa51ddeec9d5a2c9b89140f5218ec3ef4 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:19 +0100 Subject: [PATCH 273/416] mm: make vma_prepare() and friends static and internal to vma.c Now we have abstracted merge behaviour for new VMA ranges, we are able to render vma_prepare(), init_vma_prep(), vma_complete(), can_vma_merge_before() and can_vma_merge_after() static and internal to vma.c. These are internal implementation details of kernel VMA manipulation and merging mechanisms and thus should not be exposed. This also renders the functions userland testable. Link: https://lkml.kernel.org/r/7f7f1c34ce10405a6aab2714c505af3cf41b7851.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vma.c | 318 +++++++++++++++++++++++++++---------------------------- mm/vma.h | 25 ----- 2 files changed, 158 insertions(+), 185 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 749c4881fd60..eb4f32705a41 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -104,8 +104,7 @@ static void init_multi_vma_prep(struct vma_prepare *vp, * * We assume the vma may be removed as part of the merge. */ -bool -can_vma_merge_before(struct vma_merge_struct *vmg) +static bool can_vma_merge_before(struct vma_merge_struct *vmg) { pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start); @@ -127,7 +126,7 @@ can_vma_merge_before(struct vma_merge_struct *vmg) * * We assume that vma is not removed as part of the merge. */ -bool can_vma_merge_after(struct vma_merge_struct *vmg) +static bool can_vma_merge_after(struct vma_merge_struct *vmg) { if (is_mergeable_vma(vmg, /* merge_next = */ false) && is_mergeable_anon_vma(vmg->anon_vma, vmg->prev->anon_vma, vmg->prev)) { @@ -137,6 +136,162 @@ bool can_vma_merge_after(struct vma_merge_struct *vmg) return false; } +static void __vma_link_file(struct vm_area_struct *vma, + struct address_space *mapping) +{ + if (vma_is_shared_maywrite(vma)) + mapping_allow_writable(mapping); + + flush_dcache_mmap_lock(mapping); + vma_interval_tree_insert(vma, &mapping->i_mmap); + flush_dcache_mmap_unlock(mapping); +} + +/* + * Requires inode->i_mapping->i_mmap_rwsem + */ +static void __remove_shared_vm_struct(struct vm_area_struct *vma, + struct address_space *mapping) +{ + if (vma_is_shared_maywrite(vma)) + mapping_unmap_writable(mapping); + + flush_dcache_mmap_lock(mapping); + vma_interval_tree_remove(vma, &mapping->i_mmap); + flush_dcache_mmap_unlock(mapping); +} + +/* + * vma_prepare() - Helper function for handling locking VMAs prior to altering + * @vp: The initialized vma_prepare struct + */ +static void vma_prepare(struct vma_prepare *vp) +{ + if (vp->file) { + uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); + + if (vp->adj_next) + uprobe_munmap(vp->adj_next, vp->adj_next->vm_start, + vp->adj_next->vm_end); + + i_mmap_lock_write(vp->mapping); + if (vp->insert && vp->insert->vm_file) { + /* + * Put into interval tree now, so instantiated pages + * are visible to arm/parisc __flush_dcache_page + * throughout; but we cannot insert into address + * space until vma start or end is updated. + */ + __vma_link_file(vp->insert, + vp->insert->vm_file->f_mapping); + } + } + + if (vp->anon_vma) { + anon_vma_lock_write(vp->anon_vma); + anon_vma_interval_tree_pre_update_vma(vp->vma); + if (vp->adj_next) + anon_vma_interval_tree_pre_update_vma(vp->adj_next); + } + + if (vp->file) { + flush_dcache_mmap_lock(vp->mapping); + vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap); + if (vp->adj_next) + vma_interval_tree_remove(vp->adj_next, + &vp->mapping->i_mmap); + } + +} + +/* + * vma_complete- Helper function for handling the unlocking after altering VMAs, + * or for inserting a VMA. + * + * @vp: The vma_prepare struct + * @vmi: The vma iterator + * @mm: The mm_struct + */ +static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, + struct mm_struct *mm) +{ + if (vp->file) { + if (vp->adj_next) + vma_interval_tree_insert(vp->adj_next, + &vp->mapping->i_mmap); + vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap); + flush_dcache_mmap_unlock(vp->mapping); + } + + if (vp->remove && vp->file) { + __remove_shared_vm_struct(vp->remove, vp->mapping); + if (vp->remove2) + __remove_shared_vm_struct(vp->remove2, vp->mapping); + } else if (vp->insert) { + /* + * split_vma has split insert from vma, and needs + * us to insert it before dropping the locks + * (it may either follow vma or precede it). + */ + vma_iter_store(vmi, vp->insert); + mm->map_count++; + } + + if (vp->anon_vma) { + anon_vma_interval_tree_post_update_vma(vp->vma); + if (vp->adj_next) + anon_vma_interval_tree_post_update_vma(vp->adj_next); + anon_vma_unlock_write(vp->anon_vma); + } + + if (vp->file) { + i_mmap_unlock_write(vp->mapping); + uprobe_mmap(vp->vma); + + if (vp->adj_next) + uprobe_mmap(vp->adj_next); + } + + if (vp->remove) { +again: + vma_mark_detached(vp->remove, true); + if (vp->file) { + uprobe_munmap(vp->remove, vp->remove->vm_start, + vp->remove->vm_end); + fput(vp->file); + } + if (vp->remove->anon_vma) + anon_vma_merge(vp->vma, vp->remove); + mm->map_count--; + mpol_put(vma_policy(vp->remove)); + if (!vp->remove2) + WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); + vm_area_free(vp->remove); + + /* + * In mprotect's case 6 (see comments on vma_merge), + * we are removing both mid and next vmas + */ + if (vp->remove2) { + vp->remove = vp->remove2; + vp->remove2 = NULL; + goto again; + } + } + if (vp->insert && vp->file) + uprobe_mmap(vp->insert); +} + +/* + * init_vma_prep() - Initializer wrapper for vma_prepare struct + * @vp: The vma_prepare struct + * @vma: The vma that will be altered once locked + */ +static void init_vma_prep(struct vma_prepare *vp, struct vm_area_struct *vma) +{ + init_multi_vma_prep(vp, vma, NULL, NULL, NULL); +} + /* * Can the proposed VMA be merged with the left (previous) VMA taking into * account the start position of the proposed range. @@ -315,31 +470,6 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return __split_vma(vmi, vma, addr, new_below); } -/* - * init_vma_prep() - Initializer wrapper for vma_prepare struct - * @vp: The vma_prepare struct - * @vma: The vma that will be altered once locked - */ -void init_vma_prep(struct vma_prepare *vp, - struct vm_area_struct *vma) -{ - init_multi_vma_prep(vp, vma, NULL, NULL, NULL); -} - -/* - * Requires inode->i_mapping->i_mmap_rwsem - */ -static void __remove_shared_vm_struct(struct vm_area_struct *vma, - struct address_space *mapping) -{ - if (vma_is_shared_maywrite(vma)) - mapping_unmap_writable(mapping); - - flush_dcache_mmap_lock(mapping); - vma_interval_tree_remove(vma, &mapping->i_mmap); - flush_dcache_mmap_unlock(mapping); -} - /* * vma has some anon_vma assigned, and is already inserted on that * anon_vma's interval trees. @@ -372,60 +502,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma) anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root); } -static void __vma_link_file(struct vm_area_struct *vma, - struct address_space *mapping) -{ - if (vma_is_shared_maywrite(vma)) - mapping_allow_writable(mapping); - - flush_dcache_mmap_lock(mapping); - vma_interval_tree_insert(vma, &mapping->i_mmap); - flush_dcache_mmap_unlock(mapping); -} - -/* - * vma_prepare() - Helper function for handling locking VMAs prior to altering - * @vp: The initialized vma_prepare struct - */ -void vma_prepare(struct vma_prepare *vp) -{ - if (vp->file) { - uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); - - if (vp->adj_next) - uprobe_munmap(vp->adj_next, vp->adj_next->vm_start, - vp->adj_next->vm_end); - - i_mmap_lock_write(vp->mapping); - if (vp->insert && vp->insert->vm_file) { - /* - * Put into interval tree now, so instantiated pages - * are visible to arm/parisc __flush_dcache_page - * throughout; but we cannot insert into address - * space until vma start or end is updated. - */ - __vma_link_file(vp->insert, - vp->insert->vm_file->f_mapping); - } - } - - if (vp->anon_vma) { - anon_vma_lock_write(vp->anon_vma); - anon_vma_interval_tree_pre_update_vma(vp->vma); - if (vp->adj_next) - anon_vma_interval_tree_pre_update_vma(vp->adj_next); - } - - if (vp->file) { - flush_dcache_mmap_lock(vp->mapping); - vma_interval_tree_remove(vp->vma, &vp->mapping->i_mmap); - if (vp->adj_next) - vma_interval_tree_remove(vp->adj_next, - &vp->mapping->i_mmap); - } - -} - /* * dup_anon_vma() - Helper function to duplicate anon_vma * @dst: The destination VMA @@ -715,84 +791,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, return 0; } -/* - * vma_complete- Helper function for handling the unlocking after altering VMAs, - * or for inserting a VMA. - * - * @vp: The vma_prepare struct - * @vmi: The vma iterator - * @mm: The mm_struct - */ -void vma_complete(struct vma_prepare *vp, - struct vma_iterator *vmi, struct mm_struct *mm) -{ - if (vp->file) { - if (vp->adj_next) - vma_interval_tree_insert(vp->adj_next, - &vp->mapping->i_mmap); - vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap); - flush_dcache_mmap_unlock(vp->mapping); - } - - if (vp->remove && vp->file) { - __remove_shared_vm_struct(vp->remove, vp->mapping); - if (vp->remove2) - __remove_shared_vm_struct(vp->remove2, vp->mapping); - } else if (vp->insert) { - /* - * split_vma has split insert from vma, and needs - * us to insert it before dropping the locks - * (it may either follow vma or precede it). - */ - vma_iter_store(vmi, vp->insert); - mm->map_count++; - } - - if (vp->anon_vma) { - anon_vma_interval_tree_post_update_vma(vp->vma); - if (vp->adj_next) - anon_vma_interval_tree_post_update_vma(vp->adj_next); - anon_vma_unlock_write(vp->anon_vma); - } - - if (vp->file) { - i_mmap_unlock_write(vp->mapping); - uprobe_mmap(vp->vma); - - if (vp->adj_next) - uprobe_mmap(vp->adj_next); - } - - if (vp->remove) { -again: - vma_mark_detached(vp->remove, true); - if (vp->file) { - uprobe_munmap(vp->remove, vp->remove->vm_start, - vp->remove->vm_end); - fput(vp->file); - } - if (vp->remove->anon_vma) - anon_vma_merge(vp->vma, vp->remove); - mm->map_count--; - mpol_put(vma_policy(vp->remove)); - if (!vp->remove2) - WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); - vm_area_free(vp->remove); - - /* - * In mprotect's case 6 (see comments on vma_merge), - * we are removing both mid and next vmas - */ - if (vp->remove2) { - vp->remove = vp->remove2; - vp->remove2 = NULL; - goto again; - } - } - if (vp->insert && vp->file) - uprobe_mmap(vp->insert); -} - static inline void vms_clear_ptes(struct vma_munmap_struct *vms, struct ma_state *mas_detach, bool mm_wr_locked) { diff --git a/mm/vma.h b/mm/vma.h index 82354fe5edd0..819f994cf727 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -132,17 +132,6 @@ void anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma); /* Required for expand_downwards(). */ void anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma); -/* Required for do_brk_flags(). */ -void vma_prepare(struct vma_prepare *vp); - -/* Required for do_brk_flags(). */ -void init_vma_prep(struct vma_prepare *vp, - struct vm_area_struct *vma); - -/* Required for do_brk_flags(). */ -void vma_complete(struct vma_prepare *vp, - struct vma_iterator *vmi, struct mm_struct *mm); - int vma_expand(struct vma_merge_struct *vmg); int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff); @@ -277,20 +266,6 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed); void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, struct vm_area_struct *prev, struct vm_area_struct *next); -/* - * Can we merge the VMA described by vmg into the following VMA vmg->next? - * - * Required by mmap_region(). - */ -bool can_vma_merge_before(struct vma_merge_struct *vmg); - -/* - * Can we merge the VMA described by vmg into the preceding VMA vmg->prev? - * - * Required by mmap_region() and do_brk_flags(). - */ -bool can_vma_merge_after(struct vma_merge_struct *vmg); - /* We are about to modify the VMA's flags. */ struct vm_area_struct *vma_modify_flags(struct vma_iterator *vmi, struct vm_area_struct *prev, struct vm_area_struct *vma, From 65e0aa64df916861ad8579e23f885e56e5ec8647 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:20 +0100 Subject: [PATCH 274/416] mm: introduce commit_merge(), abstracting final commit of merge Pull the part of vma_expand() which actually commits the merge operation, that is inserts it into the maple tree and sets the VMA's vma->vm_start and vma->vm_end parameters, into its own function. We implement only the parts needed for vma_expand() which now as a result of previous work is also the means by which new VMA ranges are merged. The next commit in the series will implement merging of existing ranges which will extend commit_merge() to accommodate this case and result in all merges using this common code. Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vma.c | 39 +++++++++++++++++++++++++++------------ 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index eb4f32705a41..566cad2338dd 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -585,6 +585,31 @@ void validate_mm(struct mm_struct *mm) } #endif /* CONFIG_DEBUG_VM_MAPLE_TREE */ +/* Actually perform the VMA merge operation. */ +static int commit_merge(struct vma_merge_struct *vmg, + struct vm_area_struct *remove) +{ + struct vma_prepare vp; + + init_multi_vma_prep(&vp, vmg->vma, NULL, remove, NULL); + + /* Note: vma iterator must be pointing to 'start'. */ + vma_iter_config(vmg->vmi, vmg->start, vmg->end); + + if (vma_iter_prealloc(vmg->vmi, vmg->vma)) + return -ENOMEM; + + vma_prepare(&vp); + vma_adjust_trans_huge(vmg->vma, vmg->start, vmg->end, 0); + vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff); + + vma_iter_store(vmg->vmi, vmg->vma); + + vma_complete(&vp, vmg->vmi, vmg->vma->vm_mm); + + return 0; +} + /* * vma_merge_new_range - Attempt to merge a new VMA into address space * @@ -712,7 +737,6 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *vma = vmg->vma; struct vm_area_struct *next = vmg->next; - struct vma_prepare vp; mmap_assert_write_locked(vmg->mm); @@ -727,24 +751,15 @@ int vma_expand(struct vma_merge_struct *vmg) return ret; } - init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); /* Not merging but overwriting any part of next is not handled. */ - VM_WARN_ON(next && !vp.remove && + VM_WARN_ON(next && !remove_next && next != vma && vmg->end > next->vm_start); /* Only handles expanding */ VM_WARN_ON(vma->vm_start < vmg->start || vma->vm_end > vmg->end); - /* Note: vma iterator must be pointing to 'start' */ - vma_iter_config(vmg->vmi, vmg->start, vmg->end); - if (vma_iter_prealloc(vmg->vmi, vma)) + if (commit_merge(vmg, remove_next ? next : NULL)) goto nomem; - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vmg->start, vmg->end, 0); - vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff); - vma_iter_store(vmg->vmi, vma); - - vma_complete(&vp, vmg->vmi, vma->vm_mm); return 0; nomem: From cc8cb3697a8d8eabe1fb9acb8768b11c1ab607d8 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:21 +0100 Subject: [PATCH 275/416] mm: refactor vma_merge() into modify-only vma_merge_existing_range() The existing vma_merge() function is no longer required to handle what were previously referred to as cases 1-3 (i.e. the merging of a new VMA), as this is now handled by vma_merge_new_vma(). Additionally, simplify the convoluted control flow of the original, maintaining identical logic only expressed more clearly and doing away with a complicated set of cases, rather logically examining each possible outcome - merging of both the previous and subsequent VMA, merging of the previous VMA and merging of the subsequent VMA alone. We now utilise the previously implemented commit_merge() function to share logic with vma_expand() de-duplicating code and providing less surface area for bugs and confusion. In order to do so, we adjust this function to accept parameters specific to merging existing ranges. Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Liam R. Howlett Cc: Mark Brown Cc: Vlastimil Babka Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vma.c | 508 ++++++++++++++++++++-------------------- tools/testing/vma/vma.c | 9 +- 2 files changed, 264 insertions(+), 253 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 566cad2338dd..393bef832604 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -587,29 +587,278 @@ void validate_mm(struct mm_struct *mm) /* Actually perform the VMA merge operation. */ static int commit_merge(struct vma_merge_struct *vmg, - struct vm_area_struct *remove) + struct vm_area_struct *adjust, + struct vm_area_struct *remove, + struct vm_area_struct *remove2, + long adj_start, + bool expanded) { struct vma_prepare vp; - init_multi_vma_prep(&vp, vmg->vma, NULL, remove, NULL); + init_multi_vma_prep(&vp, vmg->vma, adjust, remove, remove2); - /* Note: vma iterator must be pointing to 'start'. */ - vma_iter_config(vmg->vmi, vmg->start, vmg->end); + VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && + vp.anon_vma != adjust->anon_vma); + + if (expanded) { + /* Note: vma iterator must be pointing to 'start'. */ + vma_iter_config(vmg->vmi, vmg->start, vmg->end); + } else { + vma_iter_config(vmg->vmi, adjust->vm_start + adj_start, + adjust->vm_end); + } if (vma_iter_prealloc(vmg->vmi, vmg->vma)) return -ENOMEM; vma_prepare(&vp); - vma_adjust_trans_huge(vmg->vma, vmg->start, vmg->end, 0); + vma_adjust_trans_huge(vmg->vma, vmg->start, vmg->end, adj_start); vma_set_range(vmg->vma, vmg->start, vmg->end, vmg->pgoff); - vma_iter_store(vmg->vmi, vmg->vma); + if (expanded) + vma_iter_store(vmg->vmi, vmg->vma); + + if (adj_start) { + adjust->vm_start += adj_start; + adjust->vm_pgoff += PHYS_PFN(adj_start); + if (adj_start < 0) { + WARN_ON(expanded); + vma_iter_store(vmg->vmi, adjust); + } + } vma_complete(&vp, vmg->vmi, vmg->vma->vm_mm); return 0; } +/* + * vma_merge_existing_range - Attempt to merge VMAs based on a VMA having its + * attributes modified. + * + * @vmg: Describes the modifications being made to a VMA and associated + * metadata. + * + * When the attributes of a range within a VMA change, then it might be possible + * for immediately adjacent VMAs to be merged into that VMA due to having + * identical properties. + * + * This function checks for the existence of any such mergeable VMAs and updates + * the maple tree describing the @vmg->vma->vm_mm address space to account for + * this, as well as any VMAs shrunk/expanded/deleted as a result of this merge. + * + * As part of this operation, if a merge occurs, the @vmg object will have its + * vma, start, end, and pgoff fields modified to execute the merge. Subsequent + * calls to this function should reset these fields. + * + * Returns: The merged VMA if merge succeeds, or NULL otherwise. + * + * ASSUMPTIONS: + * - The caller must assign the VMA to be modifed to @vmg->vma. + * - The caller must have set @vmg->prev to the previous VMA, if there is one. + * - The caller must not set @vmg->next, as we determine this. + * - The caller must hold a WRITE lock on the mm_struct->mmap_lock. + * - vmi must be positioned within [@vmg->vma->vm_start, @vmg->vma->vm_end). + */ +static struct vm_area_struct *vma_merge_existing_range(struct vma_merge_struct *vmg) +{ + struct vm_area_struct *vma = vmg->vma; + struct vm_area_struct *prev = vmg->prev; + struct vm_area_struct *next, *res; + struct vm_area_struct *anon_dup = NULL; + struct vm_area_struct *adjust = NULL; + unsigned long start = vmg->start; + unsigned long end = vmg->end; + bool left_side = vma && start == vma->vm_start; + bool right_side = vma && end == vma->vm_end; + int err = 0; + long adj_start = 0; + bool merge_will_delete_vma, merge_will_delete_next; + bool merge_left, merge_right, merge_both; + bool expanded; + + mmap_assert_write_locked(vmg->mm); + VM_WARN_ON(!vma); /* We are modifying a VMA, so caller must specify. */ + VM_WARN_ON(vmg->next); /* We set this. */ + VM_WARN_ON(prev && start <= prev->vm_start); + VM_WARN_ON(start >= end); + /* + * If vma == prev, then we are offset into a VMA. Otherwise, if we are + * not, we must span a portion of the VMA. + */ + VM_WARN_ON(vma && ((vma != prev && vmg->start != vma->vm_start) || + vmg->end > vma->vm_end)); + /* The vmi must be positioned within vmg->vma. */ + VM_WARN_ON(vma && !(vma_iter_addr(vmg->vmi) >= vma->vm_start && + vma_iter_addr(vmg->vmi) < vma->vm_end)); + + vmg->state = VMA_MERGE_NOMERGE; + + /* + * If a special mapping or if the range being modified is neither at the + * furthermost left or right side of the VMA, then we have no chance of + * merging and should abort. + */ + if (vmg->flags & VM_SPECIAL || (!left_side && !right_side)) + return NULL; + + if (left_side) + merge_left = can_vma_merge_left(vmg); + else + merge_left = false; + + if (right_side) { + next = vmg->next = vma_iter_next_range(vmg->vmi); + vma_iter_prev_range(vmg->vmi); + + merge_right = can_vma_merge_right(vmg, merge_left); + } else { + merge_right = false; + next = NULL; + } + + if (merge_left) /* If merging prev, position iterator there. */ + vma_prev(vmg->vmi); + else if (!merge_right) /* If we have nothing to merge, abort. */ + return NULL; + + merge_both = merge_left && merge_right; + /* If we span the entire VMA, a merge implies it will be deleted. */ + merge_will_delete_vma = left_side && right_side; + /* + * If we merge both VMAs, then next is also deleted. This implies + * merge_will_delete_vma also. + */ + merge_will_delete_next = merge_both; + + /* No matter what happens, we will be adjusting vma. */ + vma_start_write(vma); + + if (merge_left) + vma_start_write(prev); + + if (merge_right) + vma_start_write(next); + + if (merge_both) { + /* + * |<----->| + * |-------*********-------| + * prev vma next + * extend delete delete + */ + + vmg->vma = prev; + vmg->start = prev->vm_start; + vmg->end = next->vm_end; + vmg->pgoff = prev->vm_pgoff; + + /* + * We already ensured anon_vma compatibility above, so now it's + * simply a case of, if prev has no anon_vma object, which of + * next or vma contains the anon_vma we must duplicate. + */ + err = dup_anon_vma(prev, next->anon_vma ? next : vma, &anon_dup); + } else if (merge_left) { + /* + * |<----->| OR + * |<--------->| + * |-------************* + * prev vma + * extend shrink/delete + */ + + vmg->vma = prev; + vmg->start = prev->vm_start; + vmg->pgoff = prev->vm_pgoff; + + if (merge_will_delete_vma) { + /* + * can_vma_merge_after() assumed we would not be + * removing vma, so it skipped the check for + * vm_ops->close, but we are removing vma. + */ + if (vma->vm_ops && vma->vm_ops->close) + err = -EINVAL; + } else { + adjust = vma; + adj_start = vmg->end - vma->vm_start; + } + + if (!err) + err = dup_anon_vma(prev, vma, &anon_dup); + } else { /* merge_right */ + /* + * |<----->| OR + * |<--------->| + * *************-------| + * vma next + * shrink/delete extend + */ + + pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start); + + VM_WARN_ON(!merge_right); + /* If we are offset into a VMA, then prev must be vma. */ + VM_WARN_ON(vmg->start > vma->vm_start && prev && vma != prev); + + if (merge_will_delete_vma) { + vmg->vma = next; + vmg->end = next->vm_end; + vmg->pgoff = next->vm_pgoff - pglen; + } else { + /* + * We shrink vma and expand next. + * + * IMPORTANT: This is the ONLY case where the final + * merged VMA is NOT vmg->vma, but rather vmg->next. + */ + + vmg->start = vma->vm_start; + vmg->end = start; + vmg->pgoff = vma->vm_pgoff; + + adjust = next; + adj_start = -(vma->vm_end - start); + } + + err = dup_anon_vma(next, vma, &anon_dup); + } + + if (err) + goto abort; + + /* + * In nearly all cases, we expand vmg->vma. There is one exception - + * merge_right where we partially span the VMA. In this case we shrink + * the end of vmg->vma and adjust the start of vmg->next accordingly. + */ + expanded = !merge_right || merge_will_delete_vma; + + if (commit_merge(vmg, adjust, + merge_will_delete_vma ? vma : NULL, + merge_will_delete_next ? next : NULL, + adj_start, expanded)) { + if (anon_dup) + unlink_anon_vmas(anon_dup); + + vmg->state = VMA_MERGE_ERROR_NOMEM; + return NULL; + } + + res = merge_left ? prev : next; + khugepaged_enter_vma(res, vmg->flags); + + vmg->state = VMA_MERGE_SUCCESS; + return res; + +abort: + vma_iter_set(vmg->vmi, start); + vma_iter_load(vmg->vmi); + vmg->state = VMA_MERGE_ERROR_NOMEM; + return NULL; +} + /* * vma_merge_new_range - Attempt to merge a new VMA into address space * @@ -757,7 +1006,7 @@ int vma_expand(struct vma_merge_struct *vmg) /* Only handles expanding */ VM_WARN_ON(vma->vm_start < vmg->start || vma->vm_end > vmg->end); - if (commit_merge(vmg, remove_next ? next : NULL)) + if (commit_merge(vmg, NULL, remove_next ? next : NULL, NULL, 0, true)) goto nomem; return 0; @@ -1127,249 +1376,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); } -/* - * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), - * figure out whether that can be merged with its predecessor or its - * successor. Or both (it neatly fills a hole). - * - * In most cases - when called for mmap, brk or mremap - [addr,end) is - * certain not to be mapped by the time vma_merge is called; but when - * called for mprotect, it is certain to be already mapped (either at - * an offset within prev, or at the start of next), and the flags of - * this area are about to be changed to vm_flags - and the no-change - * case has already been eliminated. - * - * The following mprotect cases have to be considered, where **** is - * the area passed down from mprotect_fixup, never extending beyond one - * vma, PPPP is the previous vma, CCCC is a concurrent vma that starts - * at the same address as **** and is of the same or larger span, and - * NNNN the next vma after ****: - * - * **** **** **** - * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPCCCCCC - * cannot merge might become might become - * PPNNNNNNNNNN PPPPPPPPPPCC - * mmap, brk or case 4 below case 5 below - * mremap move: - * **** **** - * PPPP NNNN PPPPCCCCNNNN - * might become might become - * PPPPPPPPPPPP 1 or PPPPPPPPPPPP 6 or - * PPPPPPPPNNNN 2 or PPPPPPPPNNNN 7 or - * PPPPNNNNNNNN 3 PPPPNNNNNNNN 8 - * - * It is important for case 8 that the vma CCCC overlapping the - * region **** is never going to extended over NNNN. Instead NNNN must - * be extended in region **** and CCCC must be removed. This way in - * all cases where vma_merge succeeds, the moment vma_merge drops the - * rmap_locks, the properties of the merged vma will be already - * correct for the whole merged range. Some of those properties like - * vm_page_prot/vm_flags may be accessed by rmap_walks and they must - * be correct for the whole merged range immediately after the - * rmap_locks are released. Otherwise if NNNN would be removed and - * CCCC would be extended over the NNNN range, remove_migration_ptes - * or other rmap walkers (if working on addresses beyond the "end" - * parameter) may establish ptes with the wrong permissions of CCCC - * instead of the right permissions of NNNN. - * - * In the code below: - * PPPP is represented by *prev - * CCCC is represented by *curr or not represented at all (NULL) - * NNNN is represented by *next or not represented at all (NULL) - * **** is not represented - it will be merged and the vma containing the - * area is returned, or the function will return NULL - */ -static struct vm_area_struct *vma_merge(struct vma_merge_struct *vmg) -{ - struct mm_struct *mm = vmg->mm; - struct vm_area_struct *prev = vmg->prev; - struct vm_area_struct *curr, *next, *res; - struct vm_area_struct *vma, *adjust, *remove, *remove2; - struct vm_area_struct *anon_dup = NULL; - struct vma_prepare vp; - pgoff_t vma_pgoff; - int err = 0; - bool merge_prev = false; - bool merge_next = false; - bool vma_expanded = false; - unsigned long addr = vmg->start; - unsigned long end = vmg->end; - unsigned long vma_start = addr; - unsigned long vma_end = end; - pgoff_t pglen = PHYS_PFN(end - addr); - long adj_start = 0; - - vmg->state = VMA_MERGE_NOMERGE; - - /* - * We later require that vma->vm_flags == vm_flags, - * so this tests vma->vm_flags & VM_SPECIAL, too. - */ - if (vmg->flags & VM_SPECIAL) - return NULL; - - /* Does the input range span an existing VMA? (cases 5 - 8) */ - curr = find_vma_intersection(mm, prev ? prev->vm_end : 0, end); - - if (!curr || /* cases 1 - 4 */ - end == curr->vm_end) /* cases 6 - 8, adjacent VMA */ - next = vmg->next = vma_lookup(mm, end); - else - next = vmg->next = NULL; /* case 5 */ - - if (prev) { - vma_start = prev->vm_start; - vma_pgoff = prev->vm_pgoff; - - /* Can we merge the predecessor? */ - if (addr == prev->vm_end && can_vma_merge_after(vmg)) { - merge_prev = true; - vma_prev(vmg->vmi); - } - } - - /* Can we merge the successor? */ - if (next && can_vma_merge_before(vmg)) { - merge_next = true; - } - - /* Verify some invariant that must be enforced by the caller. */ - VM_WARN_ON(prev && addr <= prev->vm_start); - VM_WARN_ON(curr && (addr != curr->vm_start || end > curr->vm_end)); - VM_WARN_ON(addr >= end); - - if (!merge_prev && !merge_next) - return NULL; /* Not mergeable. */ - - if (merge_prev) - vma_start_write(prev); - - res = vma = prev; - remove = remove2 = adjust = NULL; - - /* Can we merge both the predecessor and the successor? */ - if (merge_prev && merge_next && - is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { - vma_start_write(next); - remove = next; /* case 1 */ - vma_end = next->vm_end; - err = dup_anon_vma(prev, next, &anon_dup); - if (curr) { /* case 6 */ - vma_start_write(curr); - remove = curr; - remove2 = next; - /* - * Note that the dup_anon_vma below cannot overwrite err - * since the first caller would do nothing unless next - * has an anon_vma. - */ - if (!next->anon_vma) - err = dup_anon_vma(prev, curr, &anon_dup); - } - } else if (merge_prev) { /* case 2 */ - if (curr) { - vma_start_write(curr); - if (end == curr->vm_end) { /* case 7 */ - /* - * can_vma_merge_after() assumed we would not be - * removing prev vma, so it skipped the check - * for vm_ops->close, but we are removing curr - */ - if (curr->vm_ops && curr->vm_ops->close) - err = -EINVAL; - remove = curr; - } else { /* case 5 */ - adjust = curr; - adj_start = (end - curr->vm_start); - } - if (!err) - err = dup_anon_vma(prev, curr, &anon_dup); - } - } else { /* merge_next */ - vma_start_write(next); - res = next; - if (prev && addr < prev->vm_end) { /* case 4 */ - vma_start_write(prev); - vma_end = addr; - adjust = next; - adj_start = -(prev->vm_end - addr); - err = dup_anon_vma(next, prev, &anon_dup); - } else { - /* - * Note that cases 3 and 8 are the ONLY ones where prev - * is permitted to be (but is not necessarily) NULL. - */ - vma = next; /* case 3 */ - vma_start = addr; - vma_end = next->vm_end; - vma_pgoff = next->vm_pgoff - pglen; - if (curr) { /* case 8 */ - vma_pgoff = curr->vm_pgoff; - vma_start_write(curr); - remove = curr; - err = dup_anon_vma(next, curr, &anon_dup); - } - } - } - - /* Error in anon_vma clone. */ - if (err) - goto anon_vma_fail; - - if (vma_start < vma->vm_start || vma_end > vma->vm_end) - vma_expanded = true; - - if (vma_expanded) { - vma_iter_config(vmg->vmi, vma_start, vma_end); - } else { - vma_iter_config(vmg->vmi, adjust->vm_start + adj_start, - adjust->vm_end); - } - - if (vma_iter_prealloc(vmg->vmi, vma)) - goto prealloc_fail; - - init_multi_vma_prep(&vp, vma, adjust, remove, remove2); - VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && - vp.anon_vma != adjust->anon_vma); - - vma_prepare(&vp); - vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start); - vma_set_range(vma, vma_start, vma_end, vma_pgoff); - - if (vma_expanded) - vma_iter_store(vmg->vmi, vma); - - if (adj_start) { - adjust->vm_start += adj_start; - adjust->vm_pgoff += adj_start >> PAGE_SHIFT; - if (adj_start < 0) { - WARN_ON(vma_expanded); - vma_iter_store(vmg->vmi, next); - } - } - - vma_complete(&vp, vmg->vmi, mm); - validate_mm(mm); - khugepaged_enter_vma(res, vmg->flags); - - vmg->state = VMA_MERGE_SUCCESS; - return res; - -prealloc_fail: - vmg->state = VMA_MERGE_ERROR_NOMEM; - if (anon_dup) - unlink_anon_vmas(anon_dup); - -anon_vma_fail: - if (err == -ENOMEM) - vmg->state = VMA_MERGE_ERROR_NOMEM; - - vma_iter_set(vmg->vmi, addr); - vma_iter_load(vmg->vmi); - return NULL; -} - /* * We are about to modify one or multiple of a VMA's flags, policy, userfaultfd * context and anonymous VMA name within the range [start, end). @@ -1389,7 +1395,7 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) struct vm_area_struct *merged; /* First, try to merge. */ - merged = vma_merge(vmg); + merged = vma_merge_existing_range(vmg); if (merged) return merged; diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index b7cdafec09af..25a95d9901ea 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -112,7 +112,7 @@ static struct vm_area_struct *merge_new(struct vma_merge_struct *vmg) */ static struct vm_area_struct *merge_existing(struct vma_merge_struct *vmg) { - return vma_merge(vmg); + return vma_merge_existing_range(vmg); } /* @@ -752,7 +752,12 @@ static bool test_vma_merge_with_close(void) vmg.vma = vma; /* Make sure merge does not occur. */ ASSERT_EQ(merge_existing(&vmg), NULL); - ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); + /* + * Initially this is misapprehended as an out of memory report, as the + * close() check is handled in the same way as anon_vma duplication + * failures, however a subsequent patch resolves this. + */ + ASSERT_EQ(vmg.state, VMA_MERGE_ERROR_NOMEM); cleanup_mm(&mm, &vmi); return true; From 01c373e9a5ce2273812eaf83036c5357829fb3f7 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 30 Aug 2024 19:10:22 +0100 Subject: [PATCH 276/416] mm: rework vm_ops->close() handling on VMA merge In commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") we relaxed the VMA merge rules for VMAs possessing a vm_ops->close() hook, permitting this operation in instances where we wouldn't delete the VMA as part of the merge operation. This was later corrected in commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with vma_ops->close") to account for a subtle case that the previous commit had not taken into account. In both instances, we first rely on is_mergeable_vma() to determine whether we might be dealing with a VMA that might be removed, taking advantage of the fact that a 'previous' VMA will never be deleted, only VMAs that follow it. The second patch corrects the instance where a merge of the previous VMA into a subsequent one did not correctly check whether the subsequent VMA had a vm_ops->close() handler. Both changes prevent merge cases that are actually permissible (for instance a merge of a VMA into a following VMA with a vm_ops->close(), but with no previous VMA, which would result in the next VMA being extended, not deleted). In addition, both changes fail to consider the case where a VMA that would otherwise be merged with the previous and next VMA might have vm_ops->close(), on the assumption that for this to be the case, all three would have to have the same vma->vm_file to be mergeable and thus the same vm_ops. And in addition both changes operate at 50,000 feet, trying to guess whether a VMA will be deleted. As we have majorly refactored the VMA merge operation and de-duplicated code to the point where we know precisely where deletions will occur, this patch removes the aforementioned checks altogether and instead explicitly checks whether a VMA will be deleted. In cases where a reduced merge is still possible (where we merge both previous and next VMA but the next VMA has a vm_ops->close hook, meaning we could just merge the previous and current VMA), we do so, otherwise the merge is not permitted. We take advantage of our userland testing to assert that this functions correctly - replacing the previous limited vm_ops->close() tests with tests for every single case where we delete a VMA. We also update all testing for both new and modified VMAs to set vma->vm_ops->close() in every single instance where this would not prevent the merge, to assert that we never do so. Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Cc: Liam R. Howlett Cc: Mark Brown Cc: Bert Karwatzki Cc: Jeff Xu Cc: Jiri Olsa Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: "Paul E. McKenney" Cc: Paul Moore Cc: Sidhartha Kumar Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/vma.c | 57 +++++++++----- tools/testing/vma/vma.c | 166 +++++++++++++++++++++++++++++++--------- 2 files changed, 164 insertions(+), 59 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 393bef832604..8d1686fc8d5a 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -10,14 +10,6 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next) { struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev; - /* - * If the vma has a ->close operation then the driver probably needs to - * release per-vma resources, so we don't attempt to merge those if the - * caller indicates the current vma may be removed as part of the merge, - * which is the case if we are attempting to merge the next VMA into - * this one. - */ - bool may_remove_vma = merge_next; if (!mpol_equal(vmg->policy, vma_policy(vma))) return false; @@ -33,8 +25,6 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex return false; if (vma->vm_file != vmg->file) return false; - if (may_remove_vma && vma->vm_ops && vma->vm_ops->close) - return false; if (!is_mergeable_vm_userfaultfd_ctx(vma, vmg->uffd_ctx)) return false; if (!anon_vma_name_eq(anon_vma_name(vma), vmg->anon_name)) @@ -632,6 +622,12 @@ static int commit_merge(struct vma_merge_struct *vmg, return 0; } +/* We can only remove VMAs when merging if they do not have a close hook. */ +static bool can_merge_remove_vma(struct vm_area_struct *vma) +{ + return !vma->vm_ops || !vma->vm_ops->close; +} + /* * vma_merge_existing_range - Attempt to merge VMAs based on a VMA having its * attributes modified. @@ -725,12 +721,30 @@ static struct vm_area_struct *vma_merge_existing_range(struct vma_merge_struct * merge_both = merge_left && merge_right; /* If we span the entire VMA, a merge implies it will be deleted. */ merge_will_delete_vma = left_side && right_side; + + /* + * If we need to remove vma in its entirety but are unable to do so, + * we have no sensible recourse but to abort the merge. + */ + if (merge_will_delete_vma && !can_merge_remove_vma(vma)) + return NULL; + /* * If we merge both VMAs, then next is also deleted. This implies * merge_will_delete_vma also. */ merge_will_delete_next = merge_both; + /* + * If we cannot delete next, then we can reduce the operation to merging + * prev and vma (thereby deleting vma). + */ + if (merge_will_delete_next && !can_merge_remove_vma(next)) { + merge_will_delete_next = false; + merge_right = false; + merge_both = false; + } + /* No matter what happens, we will be adjusting vma. */ vma_start_write(vma); @@ -772,21 +786,12 @@ static struct vm_area_struct *vma_merge_existing_range(struct vma_merge_struct * vmg->start = prev->vm_start; vmg->pgoff = prev->vm_pgoff; - if (merge_will_delete_vma) { - /* - * can_vma_merge_after() assumed we would not be - * removing vma, so it skipped the check for - * vm_ops->close, but we are removing vma. - */ - if (vma->vm_ops && vma->vm_ops->close) - err = -EINVAL; - } else { + if (!merge_will_delete_vma) { adjust = vma; adj_start = vmg->end - vma->vm_start; } - if (!err) - err = dup_anon_vma(prev, vma, &anon_dup); + err = dup_anon_vma(prev, vma, &anon_dup); } else { /* merge_right */ /* * |<----->| OR @@ -940,6 +945,14 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) vmg->vma = prev; vmg->pgoff = prev->vm_pgoff; + /* + * If this merge would result in removal of the next VMA but we + * are not permitted to do so, reduce the operation to merging + * prev and vma. + */ + if (can_merge_right && !can_merge_remove_vma(next)) + vmg->end = end; + vma_prev(vmg->vmi); /* Equivalent to going to the previous range */ } @@ -994,6 +1007,8 @@ int vma_expand(struct vma_merge_struct *vmg) int ret; remove_next = true; + /* This should already have been checked by this point. */ + VM_WARN_ON(!can_merge_remove_vma(next)); vma_start_write(next); ret = dup_anon_vma(vma, next, &anon_dup); if (ret) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 25a95d9901ea..c53f220eb6cc 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -387,6 +387,9 @@ static bool test_merge_new(void) struct anon_vma_chain dummy_anon_vma_chain_d = { .anon_vma = &dummy_anon_vma, }; + const struct vm_operations_struct vm_ops = { + .close = dummy_close, + }; int count; struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged; @@ -430,6 +433,7 @@ static bool test_merge_new(void) * 0123456789abc * AA*B DD CC */ + vma_a->vm_ops = &vm_ops; /* This should have no impact. */ vma_b->anon_vma = &dummy_anon_vma; vma = try_merge_new_vma(&mm, &vmg, 0x2000, 0x3000, 2, flags, &merged); ASSERT_EQ(vma, vma_a); @@ -466,6 +470,7 @@ static bool test_merge_new(void) * AAAAA *DD CC */ vma_d->anon_vma = &dummy_anon_vma; + vma_d->vm_ops = &vm_ops; /* This should have no impact. */ vma = try_merge_new_vma(&mm, &vmg, 0x6000, 0x7000, 6, flags, &merged); ASSERT_EQ(vma, vma_d); /* Prepend. */ @@ -483,6 +488,7 @@ static bool test_merge_new(void) * 0123456789abc * AAAAA*DDD CC */ + vma_d->vm_ops = NULL; /* This would otherwise degrade the merge. */ vma = try_merge_new_vma(&mm, &vmg, 0x5000, 0x6000, 5, flags, &merged); ASSERT_EQ(vma, vma_a); /* Merge with A, delete D. */ @@ -640,13 +646,11 @@ static bool test_vma_merge_with_close(void) const struct vm_operations_struct vm_ops = { .close = dummy_close, }; - struct vm_area_struct *vma_next = - alloc_and_link_vma(&mm, 0x2000, 0x3000, 2, flags); - struct vm_area_struct *vma; + struct vm_area_struct *vma_prev, *vma_next, *vma; /* - * When we merge VMAs we sometimes have to delete others as part of the - * operation. + * When merging VMAs we are not permitted to remove any VMA that has a + * vm_ops->close() hook. * * Considering the two possible adjacent VMAs to which a VMA can be * merged: @@ -697,28 +701,52 @@ static bool test_vma_merge_with_close(void) * would be set too, and thus scenario A would pick this up. */ - ASSERT_NE(vma_next, NULL); - /* - * SCENARIO A + * The only case of a new VMA merge that results in a VMA being deleted + * is one where both the previous and next VMAs are merged - in this + * instance the next VMA is deleted, and the previous VMA is extended. * - * 0123 - * *N + * If we are unable to do so, we reduce the operation to simply + * extending the prev VMA and not merging next. + * + * 0123456789 + * PPP**NNNN + * -> + * 0123456789 + * PPPPPPNNN */ - /* Make the next VMA have a close() callback. */ + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); vma_next->vm_ops = &vm_ops; - /* Our proposed VMA has characteristics that would otherwise be merged. */ - vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + ASSERT_EQ(merge_new(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x5000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); - /* The next VMA having a close() operator should cause the merge to fail.*/ - ASSERT_EQ(merge_new(&vmg), NULL); - ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); - /* Now create the VMA so we can merge via modified flags */ - vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); - vma = alloc_and_link_vma(&mm, 0x1000, 0x2000, 1, flags); + /* + * When modifying an existing VMA there are further cases where we + * delete VMAs. + * + * <> + * 0123456789 + * PPPVV + * + * In this instance, if vma has a close hook, the merge simply cannot + * proceed. + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma->vm_ops = &vm_ops; + + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; vmg.vma = vma; /* @@ -728,38 +756,90 @@ static bool test_vma_merge_with_close(void) ASSERT_EQ(merge_existing(&vmg), NULL); ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); - /* SCENARIO B + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + /* + * This case is mirrored if merging with next. * - * 0123 - * P* + * <> + * 0123456789 + * VVNNNN * - * In order for this scenario to trigger, the VMA currently being - * modified must also have a .close(). + * In this instance, if vma has a close hook, the merge simply cannot + * proceed. */ - /* Reset VMG state. */ - vmg_set_range(&vmg, 0x1000, 0x2000, 1, flags); - /* - * Make next unmergeable, and don't let the scenario A check pick this - * up, we want to reproduce scenario B only. - */ - vma_next->vm_ops = NULL; - vma_next->__vm_flags &= ~VM_MAYWRITE; - /* Allocate prev. */ - vmg.prev = alloc_and_link_vma(&mm, 0, 0x1000, 0, flags); - /* Assign a vm_ops->close() function to VMA explicitly. */ + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); vma->vm_ops = &vm_ops; + + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); vmg.vma = vma; - /* Make sure merge does not occur. */ ASSERT_EQ(merge_existing(&vmg), NULL); /* * Initially this is misapprehended as an out of memory report, as the * close() check is handled in the same way as anon_vma duplication * failures, however a subsequent patch resolves this. */ - ASSERT_EQ(vmg.state, VMA_MERGE_ERROR_NOMEM); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); + + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); + + /* + * Finally, we consider two variants of the case where we modify a VMA + * to merge with both the previous and next VMAs. + * + * The first variant is where vma has a close hook. In this instance, no + * merge can proceed. + * + * <> + * 0123456789 + * PPPVVNNNN + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma->vm_ops = &vm_ops; + + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), NULL); + ASSERT_EQ(vmg.state, VMA_MERGE_NOMERGE); + + ASSERT_EQ(cleanup_mm(&mm, &vmi), 3); + + /* + * The second variant is where next has a close hook. In this instance, + * we reduce the operation to a merge between prev and vma. + * + * <> + * 0123456789 + * PPPVVNNNN + * -> + * 0123456789 + * PPPPPNNNN + */ + + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vma_next = alloc_and_link_vma(&mm, 0x5000, 0x9000, 5, flags); + vma_next->vm_ops = &vm_ops; + + vmg_set_range(&vmg, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.vma = vma; + + ASSERT_EQ(merge_existing(&vmg), vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); + ASSERT_EQ(vma_prev->vm_start, 0); + ASSERT_EQ(vma_prev->vm_end, 0x5000); + ASSERT_EQ(vma_prev->vm_pgoff, 0); + + ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); - cleanup_mm(&mm, &vmi); return true; } @@ -828,6 +908,9 @@ static bool test_merge_existing(void) .mm = &mm, .vmi = &vmi, }; + const struct vm_operations_struct vm_ops = { + .close = dummy_close, + }; /* * Merge right case - partial span. @@ -840,7 +923,9 @@ static bool test_merge_existing(void) * VNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); + vma->vm_ops = &vm_ops; /* This should have no impact. */ vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); vmg.vma = vma; vmg.prev = vma; @@ -873,6 +958,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, flags); vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, flags); + vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range(&vmg, 0x2000, 0x6000, 2, flags); vmg.vma = vma; vma->anon_vma = &dummy_anon_vma; @@ -899,7 +985,9 @@ static bool test_merge_existing(void) * PPPPPPV */ vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); + vma->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range(&vmg, 0x3000, 0x6000, 3, flags); vmg.prev = vma_prev; vmg.vma = vma; @@ -932,6 +1020,7 @@ static bool test_merge_existing(void) * PPPPPPP */ vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); vmg.prev = vma_prev; @@ -960,6 +1049,7 @@ static bool test_merge_existing(void) * PPPPPPPPPP */ vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, flags); + vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, flags); vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, flags); vmg_set_range(&vmg, 0x3000, 0x7000, 3, flags); From 7de8728f55ffd6497d4ef590218fad8da7f1af92 Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Tue, 27 Aug 2024 21:09:16 +0200 Subject: [PATCH 277/416] mm: vmalloc: refactor vm_area_alloc_pages() function The aim is to simplify and making the vm_area_alloc_pages() function less confusing as it became more clogged nowadays: - eliminate a "bulk_gfp" variable and do not overwrite a gfp flag for bulk allocator; - drop __GFP_NOFAIL flag for high-order-page requests on upper layer. It becomes less spread between levels when it comes to __GFP_NOFAIL allocations; - add a comment about a fallback path if high-order attempt is unsuccessful because for such cases __GFP_NOFAIL is dropped; - fix a typo in a commit message. Link: https://lkml.kernel.org/r/20240827190916.34242-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Acked-by: Michal Hocko Reviewed-by: Baoquan He Cc: Christoph Hellwig Cc: Oleksiy Avramchenko Signed-off-by: Andrew Morton --- mm/vmalloc.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index dfa8f09e7e9d..264a2627bea4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3515,8 +3515,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid, unsigned int order, unsigned int nr_pages, struct page **pages) { unsigned int nr_allocated = 0; - gfp_t alloc_gfp = gfp; - bool nofail = gfp & __GFP_NOFAIL; struct page *page; int i; @@ -3527,9 +3525,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * more permissive. */ if (!order) { - /* bulk allocator doesn't support nofail req. officially */ - gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL; - while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; @@ -3547,12 +3542,11 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * but mempolicy wants to alloc memory by interleaving. */ if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) - nr = alloc_pages_bulk_array_mempolicy_noprof(bulk_gfp, + nr = alloc_pages_bulk_array_mempolicy_noprof(gfp, nr_pages_request, pages + nr_allocated); - else - nr = alloc_pages_bulk_array_node_noprof(bulk_gfp, nid, + nr = alloc_pages_bulk_array_node_noprof(gfp, nid, nr_pages_request, pages + nr_allocated); @@ -3566,30 +3560,24 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (nr != nr_pages_request) break; } - } else if (gfp & __GFP_NOFAIL) { - /* - * Higher order nofail allocations are really expensive and - * potentially dangerous (pre-mature OOM, disruptive reclaim - * and compaction etc. - */ - alloc_gfp &= ~__GFP_NOFAIL; } /* High-order pages or fallback path if "bulk" fails. */ while (nr_allocated < nr_pages) { - if (!nofail && fatal_signal_pending(current)) + if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current)) break; if (nid == NUMA_NO_NODE) - page = alloc_pages_noprof(alloc_gfp, order); + page = alloc_pages_noprof(gfp, order); else - page = alloc_pages_node_noprof(nid, alloc_gfp, order); + page = alloc_pages_node_noprof(nid, gfp, order); + if (unlikely(!page)) break; /* * Higher order allocations must be able to be treated as - * indepdenent small pages by callers (as they can with + * independent small pages by callers (as they can with * small-page vmallocs). Some drivers do their own refcounting * on vmalloc_to_page() pages, some use page->mapping, * page->lru, etc. @@ -3650,7 +3638,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, set_vm_area_page_order(area, page_shift - PAGE_SHIFT); page_order = vm_area_page_order(area); - area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN, + /* + * Higher order nofail allocations are really expensive and + * potentially dangerous (pre-mature OOM, disruptive reclaim + * and compaction etc. + * + * Please note, the __vmalloc_node_range_noprof() falls-back + * to order-0 pages if high-order attempt is unsuccessful. + */ + area->nr_pages = vm_area_alloc_pages((page_order ? + gfp_mask & ~__GFP_NOFAIL : gfp_mask) | __GFP_NOWARN, node, page_order, nr_small_pages, area->pages); atomic_long_add(area->nr_pages, &nr_vmalloc_pages); From 073c78edf5bbd32046a117fc7f71312dc3d2c607 Mon Sep 17 00:00:00 2001 From: Yanfei Xu Date: Tue, 27 Aug 2024 19:36:14 +0800 Subject: [PATCH 278/416] memory tier: fix deadlock warning while onlining pages commit 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers") introduces a locking change that use guard(mutex) to instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly expanded the locked region to include the hotplug_memory_notifier(), as a result, it triggers an locking dependency detected of ABBA deadlock. Exclude hotplug_memory_notifier() from the locked region to fixing it. The deadlock scenario is that when a memory online event occurs, the execution of memory notifier will access the read lock of the memory_chain.rwsem, then the reigistration of the memory notifier in memory_tier_init() acquires the write lock of the memory_chain.rwsem while holding memory_tier_lock. Then the memory online event continues to invoke the memory hotplug callback registered by memory_tier_init(). Since this callback tries to acquire the memory_tier_lock, a deadlock occurs. In fact, this deadlock can't happen because memory_tier_init() always executes before memory online events happen due to the subsys_initcall() has an higher priority than module_init(). [ 133.491106] WARNING: possible circular locking dependency detected [ 133.493656] 6.11.0-rc2+ #146 Tainted: G O N [ 133.504290] ------------------------------------------------------ [ 133.515194] (udev-worker)/1133 is trying to acquire lock: [ 133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0 [ 133.536449] [ 133.536449] but task is already holding lock: [ 133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0 [ 133.556781] [ 133.556781] which lock already depends on the new lock. [ 133.556781] [ 133.569957] [ 133.569957] the existing dependency chain (in reverse order) is: [ 133.577618] [ 133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}: [ 133.584997] down_write+0x97/0x210 [ 133.588647] blocking_notifier_chain_register+0x71/0xd0 [ 133.592537] register_memory_notifier+0x26/0x30 [ 133.596314] memory_tier_init+0x187/0x300 [ 133.599864] do_one_initcall+0x117/0x5d0 [ 133.603399] kernel_init_freeable+0xab0/0xeb0 [ 133.606986] kernel_init+0x28/0x2f0 [ 133.610312] ret_from_fork+0x59/0x90 [ 133.613652] ret_from_fork_asm+0x1a/0x30 [ 133.617012] [ 133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}: [ 133.623390] __lock_acquire+0x2efd/0x5c60 [ 133.626730] lock_acquire+0x1ce/0x580 [ 133.629757] __mutex_lock+0x15c/0x1490 [ 133.632731] mutex_lock_nested+0x1f/0x30 [ 133.635717] memtier_hotplug_callback+0x383/0x4b0 [ 133.638748] notifier_call_chain+0xbf/0x370 [ 133.641647] blocking_notifier_call_chain+0x76/0xb0 [ 133.644636] memory_notify+0x2e/0x40 [ 133.647427] online_pages+0x597/0x720 [ 133.650246] memory_subsys_online+0x4f6/0x7f0 [ 133.653107] device_online+0x141/0x1d0 [ 133.655831] online_memory_block+0x4d/0x60 [ 133.658616] walk_memory_blocks+0xc0/0x120 [ 133.661419] add_memory_resource+0x51d/0x6c0 [ 133.664202] add_memory_driver_managed+0xf5/0x180 [ 133.667060] dev_dax_kmem_probe+0x7f7/0xb40 [kmem] [ 133.669949] dax_bus_probe+0x147/0x230 [ 133.672687] really_probe+0x27f/0xac0 [ 133.675463] __driver_probe_device+0x1f3/0x460 [ 133.678493] driver_probe_device+0x56/0x1b0 [ 133.681366] __driver_attach+0x277/0x570 [ 133.684149] bus_for_each_dev+0x145/0x1e0 [ 133.686937] driver_attach+0x49/0x60 [ 133.689673] bus_add_driver+0x2f3/0x6b0 [ 133.692421] driver_register+0x170/0x4b0 [ 133.695118] __dax_driver_register+0x141/0x1b0 [ 133.697910] dax_kmem_init+0x54/0xff0 [kmem] [ 133.700794] do_one_initcall+0x117/0x5d0 [ 133.703455] do_init_module+0x277/0x750 [ 133.706054] load_module+0x5d1d/0x74f0 [ 133.708602] init_module_from_file+0x12c/0x1a0 [ 133.711234] idempotent_init_module+0x3f1/0x690 [ 133.713937] __x64_sys_finit_module+0x10e/0x1a0 [ 133.716492] x64_sys_call+0x184d/0x20d0 [ 133.719053] do_syscall_64+0x6d/0x140 [ 133.721537] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 133.724239] [ 133.724239] other info that might help us debug this: [ 133.724239] [ 133.730832] Possible unsafe locking scenario: [ 133.730832] [ 133.735298] CPU0 CPU1 [ 133.737759] ---- ---- [ 133.740165] rlock((memory_chain).rwsem); [ 133.742623] lock(memory_tier_lock); [ 133.745357] lock((memory_chain).rwsem); [ 133.748141] lock(memory_tier_lock); [ 133.750489] [ 133.750489] *** DEADLOCK *** [ 133.750489] [ 133.756742] 6 locks held by (udev-worker)/1133: [ 133.759179] #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570 [ 133.762299] #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30 [ 133.765565] #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0 [ 133.768978] #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30 [ 133.772312] #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30 [ 133.775544] #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0 [ 133.779113] [ 133.779113] stack backtrace: [ 133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G O N 6.11.0-rc2+ #146 [ 133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST [ 133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 133.793291] Call Trace: [ 133.795826] [ 133.798284] dump_stack_lvl+0xea/0x150 [ 133.801025] dump_stack+0x19/0x20 [ 133.803609] print_circular_bug+0x477/0x740 [ 133.806341] check_noncircular+0x2f4/0x3e0 [ 133.809056] ? __pfx_check_noncircular+0x10/0x10 [ 133.811866] ? __pfx_lockdep_lock+0x10/0x10 [ 133.814670] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 133.817610] __lock_acquire+0x2efd/0x5c60 [ 133.820339] ? __pfx___lock_acquire+0x10/0x10 [ 133.823128] ? __dax_driver_register+0x141/0x1b0 [ 133.825926] ? do_one_initcall+0x117/0x5d0 [ 133.828648] lock_acquire+0x1ce/0x580 [ 133.831349] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.834293] ? __pfx_lock_acquire+0x10/0x10 [ 133.837134] __mutex_lock+0x15c/0x1490 [ 133.839829] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.842753] ? memtier_hotplug_callback+0x383/0x4b0 [ 133.845602] ? __this_cpu_preempt_check+0x21/0x30 [ 133.848438] ? __pfx___mutex_lock+0x10/0x10 [ 133.851200] ? __pfx_lock_acquire+0x10/0x10 [ 133.853935] ? global_dirty_limits+0xc0/0x160 [ 133.856699] ? __sanitizer_cov_trace_switch+0x58/0xa0 [ 133.859564] mutex_lock_nested+0x1f/0x30 [ 133.862251] ? mutex_lock_nested+0x1f/0x30 [ 133.864964] memtier_hotplug_callback+0x383/0x4b0 [ 133.867752] notifier_call_chain+0xbf/0x370 [ 133.870550] ? writeback_set_ratelimit+0xe8/0x160 [ 133.873372] blocking_notifier_call_chain+0x76/0xb0 [ 133.876311] memory_notify+0x2e/0x40 [ 133.879013] online_pages+0x597/0x720 [ 133.881686] ? irqentry_exit+0x3e/0xa0 [ 133.884397] ? __pfx_online_pages+0x10/0x10 [ 133.887244] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 133.890299] ? mhp_init_memmap_on_memory+0x7a/0x1c0 [ 133.893203] memory_subsys_online+0x4f6/0x7f0 [ 133.896099] ? __pfx_memory_subsys_online+0x10/0x10 [ 133.899039] ? xa_load+0x16d/0x2e0 [ 133.901667] ? __pfx_xa_load+0x10/0x10 [ 133.904366] ? __pfx_memory_subsys_online+0x10/0x10 [ 133.907218] device_online+0x141/0x1d0 [ 133.909845] online_memory_block+0x4d/0x60 [ 133.912494] walk_memory_blocks+0xc0/0x120 [ 133.915104] ? __pfx_online_memory_block+0x10/0x10 [ 133.917776] add_memory_resource+0x51d/0x6c0 [ 133.920404] ? __pfx_add_memory_resource+0x10/0x10 [ 133.923104] ? _raw_write_unlock+0x31/0x60 [ 133.925781] ? register_memory_resource+0x119/0x180 [ 133.928450] add_memory_driver_managed+0xf5/0x180 [ 133.931036] dev_dax_kmem_probe+0x7f7/0xb40 [kmem] [ 133.933665] ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem] [ 133.936332] ? __pfx___up_read+0x10/0x10 [ 133.938878] dax_bus_probe+0x147/0x230 [ 133.941332] ? __pfx_dax_bus_probe+0x10/0x10 [ 133.943954] really_probe+0x27f/0xac0 [ 133.946387] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30 [ 133.949106] __driver_probe_device+0x1f3/0x460 [ 133.951704] ? parse_option_str+0x149/0x190 [ 133.954241] driver_probe_device+0x56/0x1b0 [ 133.956749] __driver_attach+0x277/0x570 [ 133.959228] ? __pfx___driver_attach+0x10/0x10 [ 133.961776] bus_for_each_dev+0x145/0x1e0 [ 133.964367] ? __pfx_bus_for_each_dev+0x10/0x10 [ 133.967019] ? __kasan_check_read+0x15/0x20 [ 133.969543] ? _raw_spin_unlock+0x31/0x60 [ 133.972132] driver_attach+0x49/0x60 [ 133.974536] bus_add_driver+0x2f3/0x6b0 [ 133.977044] driver_register+0x170/0x4b0 [ 133.979480] __dax_driver_register+0x141/0x1b0 [ 133.982126] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 133.984724] dax_kmem_init+0x54/0xff0 [kmem] [ 133.987284] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 133.989965] do_one_initcall+0x117/0x5d0 [ 133.992506] ? __pfx_do_one_initcall+0x10/0x10 [ 133.995185] ? __kasan_kmalloc+0x88/0xa0 [ 133.997748] ? kasan_poison+0x3e/0x60 [ 134.000288] ? kasan_unpoison+0x2c/0x60 [ 134.002762] ? kasan_poison+0x3e/0x60 [ 134.005202] ? __asan_register_globals+0x62/0x80 [ 134.007753] ? __pfx_dax_kmem_init+0x10/0x10 [kmem] [ 134.010439] do_init_module+0x277/0x750 [ 134.012953] load_module+0x5d1d/0x74f0 [ 134.015406] ? __pfx_load_module+0x10/0x10 [ 134.017887] ? __pfx_ima_post_read_file+0x10/0x10 [ 134.020470] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 134.023127] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.025767] ? security_kernel_post_read_file+0xa2/0xd0 [ 134.028429] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.031162] ? kernel_read_file+0x503/0x820 [ 134.033645] ? __pfx_kernel_read_file+0x10/0x10 [ 134.036232] ? __pfx___lock_acquire+0x10/0x10 [ 134.038766] init_module_from_file+0x12c/0x1a0 [ 134.041291] ? init_module_from_file+0x12c/0x1a0 [ 134.043936] ? __pfx_init_module_from_file+0x10/0x10 [ 134.046516] ? __this_cpu_preempt_check+0x21/0x30 [ 134.049091] ? __kasan_check_read+0x15/0x20 [ 134.051551] ? do_raw_spin_unlock+0x60/0x210 [ 134.054077] idempotent_init_module+0x3f1/0x690 [ 134.056643] ? __pfx_idempotent_init_module+0x10/0x10 [ 134.059318] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 134.061995] ? __fget_light+0x17d/0x210 [ 134.064428] __x64_sys_finit_module+0x10e/0x1a0 [ 134.066976] x64_sys_call+0x184d/0x20d0 [ 134.069405] do_syscall_64+0x6d/0x140 [ 134.071926] entry_SYSCALL_64_after_hwframe+0x76/0x7e [yanfei.xu@intel.com: add mutex_lock/unlock() pair back] Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com Fixes: 823430c8e9d9 ("memory tier: consolidate the initialization of memory tiers") Signed-off-by: Yanfei Xu Reviewed-by: "Huang, Ying" Cc: Ho-Ren (Jack) Chuang Cc: Jonathan Cameron Signed-off-by: Andrew Morton --- mm/memory-tiers.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2a642ea86cb2..ed7607f692bd 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -914,13 +914,14 @@ static int __init memory_tier_init(void) WARN_ON(!node_demotion); #endif - guard(mutex)(&memory_tier_lock); + mutex_lock(&memory_tier_lock); /* * For now we can have 4 faster memory tiers with smaller adistance * than default DRAM tier. */ default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, &default_memory_types); + mutex_unlock(&memory_tier_lock); if (IS_ERR(default_dram_type)) panic("%s() failed to allocate default DRAM tier\n", __func__); From f22cde4371f3c624e947a35b075c06c771442a43 Mon Sep 17 00:00:00 2001 From: Yujie Liu Date: Tue, 27 Aug 2024 19:29:58 +0800 Subject: [PATCH 279/416] sched/numa: Fix the vma scan starving issue Problem statement: Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of the vma scan might create less Numa page fault information. The insufficient information makes it harder for the Numa balancer to make decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to bring back part of the performance. Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a long duration of remote Numa node read was observed by PMU events: A few cores having ~500MB/s remote memory access for ~20 seconds. It causes high core-to-core variance and performance penalty. After the investigation, it is found that many vmas are skipped due to the active PID check. According to the trace events, in most cases, vma_is_accessed() returns false because the history access info stored in pids_active array has been cleared. Proposal: The main idea is to adjust vma_is_accessed() to let it return true easier. Thus compare the diff between mm->numa_scan_seq and vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold, scan the vma. This patch especially helps the cases where there are small number of threads, like the process-based SPECcpu. Without this patch, if the SPECcpu process access the vma at the beginning, then sleeps for a long time, the pid_active array will be cleared. A a result, if this process is woken up again, it never has a chance to set prot_none anymore. Because only the first 2 times of access is granted for vma scan: (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be worse, no other threads within the task can help set the prot_none. This causes information lost. Raghavendra helped test current patch and got the positive result on the AMD platform: autonumabench NUMA01 base patched Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%* Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%* Duration User 380345.36 368252.04 Duration System 1358.89 1156.23 Duration Elapsed 2277.45 2213.25 autonumabench NUMA02 Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%* Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%* Duration User 1513.23 1575.48 Duration System 8.33 8.13 Duration Elapsed 28.59 29.71 kernbench Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%* Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%* Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%* Duration User 68816.41 67615.74 Duration System 21873.94 22848.08 Duration Elapsed 506.66 504.55 Intel 256 CPUs/2 Sockets: autonuma benchmark also shows improvements: v6.10-rc5 v6.10-rc5 +patch Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%* Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%* Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%* Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%* Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%* Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%* Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%* Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%* v6.10-rc5 v6.10-rc5 +patch Duration User 1064010.16 1075416.23 Duration System 3307.64 3104.66 Duration Elapsed 4537.54 4604.73 The SPECcpu remote node access issue disappears with the patch applied. Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") Signed-off-by: Chen Yu Co-developed-by: Chen Yu Signed-off-by: Yujie Liu Reported-by: Xiaoping Zhou Reviewed-and-tested-by: Raghavendra K T Acked-by: Mel Gorman Cc: "Chen, Tim C" Cc: Ingo Molnar Cc: Juri Lelli Cc: Peter Zijlstra Cc: Raghavendra K T Cc: Vincent Guittot Signed-off-by: Andrew Morton --- kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 20ea6d085305..a1b756f927b2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3187,6 +3187,15 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma) return true; } + /* + * This vma has not been accessed for a while, and if the number + * the threads in the same process is low, which means no other + * threads can help scan this vma, force a vma scan. + */ + if (READ_ONCE(mm->numa_scan_seq) > + (vma->numab_state->prev_scan_seq + get_nr_threads(current))) + return true; + return false; } From 9cb75552f421c5e9667bc19fb430063cb023c219 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:28 -0700 Subject: [PATCH 280/416] selftests/damon: add access_memory_even to .gitignore Patch series "misc fixups for DAMON {self,kunit} tests". This patchset is for minor fixups of DAMON selftests and kunit tests. First three patches make DAMON selftests more cleanly maintained (patches 1 and 2) without unnecessary warnings (patch 3). Following six patches remove unnecessary test case (patch 4), handle configs combinations that can make tests fail (patches 5-7), reorganize the test files following the new guideline (patch 8), and add reference kunitconfig for DAMON kunit tests (patch 9). This patch (of 9): DAMON selftests build access_memory_even, but its not on the .gitignore list. Add it to make 'git status' output cleaner. Link: https://lkml.kernel.org/r/20240827030336.7930-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240827030336.7930-2-sj@kernel.org Fixes: c94df805c774 ("selftests/damon: implement a program for even-numbered memory regions access") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/damon/.gitignore b/tools/testing/selftests/damon/.gitignore index e65ef9d9cedc..2ab675fecb6b 100644 --- a/tools/testing/selftests/damon/.gitignore +++ b/tools/testing/selftests/damon/.gitignore @@ -3,3 +3,4 @@ huge_count_read_write debugfs_target_ids_read_before_terminate_race debugfs_target_ids_pid_leak access_memory +access_memory_even From 582c04b07fa91665b4f38682c290fb9f0b085c7e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:29 -0700 Subject: [PATCH 281/416] selftests/damon: cleanup __pycache__/ with 'make clean' Python-based tests creates __pycache__/ directory. Remove it with 'make clean' by defining it as EXTRA_CLEAN. Link: https://lkml.kernel.org/r/20240827030336.7930-3-sj@kernel.org Fixes: b5906f5f7359 ("selftests/damon: add a test for update_schemes_tried_regions sysfs command") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile index 1e2e98cc809d..5b2a6a5dd1af 100644 --- a/tools/testing/selftests/damon/Makefile +++ b/tools/testing/selftests/damon/Makefile @@ -25,4 +25,6 @@ TEST_PROGS += debugfs_target_ids_pid_leak.sh TEST_PROGS += sysfs_update_removed_scheme_dir.sh TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py +EXTRA_CLEAN = __pycache__ + include ../lib.mk From 8c211412c5dffd090eaea5ee033cd729f8e5f873 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:30 -0700 Subject: [PATCH 282/416] selftests/damon: add execute permissions to test scripts Some test scripts are missing executable permissions. It causes warnings that make the test output unnecessarily verbose. Add executable permissions. Link: https://lkml.kernel.org/r/20240827030336.7930-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- tools/testing/selftests/damon/damon_nr_regions.py | 0 tools/testing/selftests/damon/damos_apply_interval.py | 0 tools/testing/selftests/damon/damos_quota.py | 0 tools/testing/selftests/damon/damos_quota_goal.py | 0 tools/testing/selftests/damon/damos_tried_regions.py | 0 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh | 0 .../damon/debugfs_target_ids_read_before_terminate_race.sh | 0 .../selftests/damon/sysfs_update_schemes_tried_regions_hang.py | 0 .../damon/sysfs_update_schemes_tried_regions_wss_estimation.py | 0 9 files changed, 0 insertions(+), 0 deletions(-) mode change 100644 => 100755 tools/testing/selftests/damon/damon_nr_regions.py mode change 100644 => 100755 tools/testing/selftests/damon/damos_apply_interval.py mode change 100644 => 100755 tools/testing/selftests/damon/damos_quota.py mode change 100644 => 100755 tools/testing/selftests/damon/damos_quota_goal.py mode change 100644 => 100755 tools/testing/selftests/damon/damos_tried_regions.py mode change 100644 => 100755 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh mode change 100644 => 100755 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh mode change 100644 => 100755 tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_hang.py mode change 100644 => 100755 tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_wss_estimation.py diff --git a/tools/testing/selftests/damon/damon_nr_regions.py b/tools/testing/selftests/damon/damon_nr_regions.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/damos_apply_interval.py b/tools/testing/selftests/damon/damos_apply_interval.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/damos_quota.py b/tools/testing/selftests/damon/damos_quota.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/damos_quota_goal.py b/tools/testing/selftests/damon/damos_quota_goal.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/damos_tried_regions.py b/tools/testing/selftests/damon/damos_tried_regions.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh b/tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh b/tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_hang.py b/tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_hang.py old mode 100644 new mode 100755 diff --git a/tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_wss_estimation.py b/tools/testing/selftests/damon/sysfs_update_schemes_tried_regions_wss_estimation.py old mode 100644 new mode 100755 From 9fcce7e7be38fc1da7673d2d69d72ab5c66d154e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:31 -0700 Subject: [PATCH 283/416] mm/damon/core-test: test only vaddr case on ops registration test DAMON ops registration kunit test tests both vaddr and paddr use cases in parts of the whole test cases. Basically testing only one ops use case is enough. Do the test with only vaddr use case. Link: https://lkml.kernel.org/r/20240827030336.7930-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/core-test.h | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h index 0cee634f3544..ef36d586d6ee 100644 --- a/mm/damon/core-test.h +++ b/mm/damon/core-test.h @@ -246,16 +246,12 @@ static void damon_test_split_regions_of(struct kunit *test) static void damon_test_ops_registration(struct kunit *test) { struct damon_ctx *c = damon_new_ctx(); - struct damon_operations ops, bak; + struct damon_operations ops = {.id = DAMON_OPS_VADDR}, bak; - /* DAMON_OPS_{V,P}ADDR are registered on subsys_initcall */ + /* DAMON_OPS_VADDR is registered on subsys_initcall */ KUNIT_EXPECT_EQ(test, damon_select_ops(c, DAMON_OPS_VADDR), 0); - KUNIT_EXPECT_EQ(test, damon_select_ops(c, DAMON_OPS_PADDR), 0); /* Double-registration is prohibited */ - ops.id = DAMON_OPS_VADDR; - KUNIT_EXPECT_EQ(test, damon_register_ops(&ops), -EINVAL); - ops.id = DAMON_OPS_PADDR; KUNIT_EXPECT_EQ(test, damon_register_ops(&ops), -EINVAL); /* Unknown ops id cannot be registered */ From e43772dcdf21a8cdf3666ee35b2a7166ad39b255 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:32 -0700 Subject: [PATCH 284/416] mm/damon/core-test: fix damon_test_ops_registration() for DAMON_VADDR unset case DAMON core kunit test can be executed without CONFIG_DAMON_VADDR. In the case, vaddr DAMON ops is not registered. Meanwhile, ops registration kunit test assumes the vaddr ops is registered. Check and handle the case by registrering fake vaddr ops inside the test code. Link: https://lkml.kernel.org/r/20240827030336.7930-6-sj@kernel.org Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/core-test.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h index ef36d586d6ee..ae03df71737e 100644 --- a/mm/damon/core-test.h +++ b/mm/damon/core-test.h @@ -247,8 +247,16 @@ static void damon_test_ops_registration(struct kunit *test) { struct damon_ctx *c = damon_new_ctx(); struct damon_operations ops = {.id = DAMON_OPS_VADDR}, bak; + bool need_cleanup = false; - /* DAMON_OPS_VADDR is registered on subsys_initcall */ + /* DAMON_OPS_VADDR is registered only if CONFIG_DAMON_VADDR is set */ + if (!damon_is_registered_ops(DAMON_OPS_VADDR)) { + bak.id = DAMON_OPS_VADDR; + KUNIT_EXPECT_EQ(test, damon_register_ops(&bak), 0); + need_cleanup = true; + } + + /* DAMON_OPS_VADDR is ensured to be registered */ KUNIT_EXPECT_EQ(test, damon_select_ops(c, DAMON_OPS_VADDR), 0); /* Double-registration is prohibited */ @@ -274,6 +282,13 @@ static void damon_test_ops_registration(struct kunit *test) KUNIT_EXPECT_EQ(test, damon_register_ops(&ops), -EINVAL); damon_destroy_ctx(c); + + if (need_cleanup) { + mutex_lock(&damon_ops_lock); + damon_registered_ops[DAMON_OPS_VADDR] = + (struct damon_operations){}; + mutex_unlock(&damon_ops_lock); + } } static void damon_test_set_regions(struct kunit *test) From 8e34bac5a268b2fb3a88231a9b94cada8adcc8c0 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:33 -0700 Subject: [PATCH 285/416] mm/damon/dbgfs-test: skip dbgfs_set_targets() test if PADDR is not registered The test depends on registration of DAMON_OPS_PADDR. It would be registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do fake ops registration for such case. However, the functions for such fake ops registration is not available to DAMON debugfs interface. Just skip the test in the case. Link: https://lkml.kernel.org/r/20240827030336.7930-7-sj@kernel.org Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/dbgfs-test.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/damon/dbgfs-test.h b/mm/damon/dbgfs-test.h index 2d85217f5ba4..9bd5dca5d4ad 100644 --- a/mm/damon/dbgfs-test.h +++ b/mm/damon/dbgfs-test.h @@ -73,6 +73,11 @@ static void damon_dbgfs_test_set_targets(struct kunit *test) struct damon_ctx *ctx = dbgfs_new_ctx(); char buf[64]; + if (!damon_is_registered_ops(DAMON_OPS_PADDR)) { + dbgfs_destroy_ctx(ctx); + kunit_skip(test, "PADDR not registered"); + } + /* Make DAMON consider target has no pid */ damon_select_ops(ctx, DAMON_OPS_PADDR); From 61879eed1f180ff92c2324b24d06eeddcc9256bd Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:34 -0700 Subject: [PATCH 286/416] mm/damon/dbgfs-test: skip dbgfs_set_init_regions() test if PADDR is not registered The test depends on registration of DAMON_OPS_PADDR. It would be registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do fake ops registration for such case. However, the functions for such fake ops registration is not available to DAMON debugfs interface. Just skip the test in the case. Link: https://lkml.kernel.org/r/20240827030336.7930-8-sj@kernel.org Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change") Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/dbgfs-test.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/damon/dbgfs-test.h b/mm/damon/dbgfs-test.h index 9bd5dca5d4ad..d2ecfcc8db86 100644 --- a/mm/damon/dbgfs-test.h +++ b/mm/damon/dbgfs-test.h @@ -116,6 +116,11 @@ static void damon_dbgfs_test_set_init_regions(struct kunit *test) int i, rc; char buf[256]; + if (!damon_is_registered_ops(DAMON_OPS_PADDR)) { + damon_destroy_ctx(ctx); + kunit_skip(test, "PADDR not registered"); + } + damon_select_ops(ctx, DAMON_OPS_PADDR); dbgfs_set_targets(ctx, 3, NULL); From 9bfbaa5e44c52422a046ce291469c8ebeb6c475d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:35 -0700 Subject: [PATCH 287/416] mm/damon: move kunit tests to tests/ subdirectory with _kunit suffix There was a discussion about better places for kunit test code[1] and test file name suffix[2]. Folowwing the conclusion, move kunit tests for DAMON to mm/damon/tests/ subdirectory and rename those. [1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com [2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com Link: https://lkml.kernel.org/r/20240827030336.7930-9-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/core.c | 2 +- mm/damon/dbgfs.c | 2 +- mm/damon/sysfs.c | 2 +- mm/damon/{core-test.h => tests/core-kunit.h} | 0 mm/damon/{dbgfs-test.h => tests/dbgfs-kunit.h} | 0 mm/damon/{sysfs-test.h => tests/sysfs-kunit.h} | 0 mm/damon/{vaddr-test.h => tests/vaddr-kunit.h} | 0 mm/damon/vaddr.c | 2 +- 8 files changed, 4 insertions(+), 4 deletions(-) rename mm/damon/{core-test.h => tests/core-kunit.h} (100%) rename mm/damon/{dbgfs-test.h => tests/dbgfs-kunit.h} (100%) rename mm/damon/{sysfs-test.h => tests/sysfs-kunit.h} (100%) rename mm/damon/{vaddr-test.h => tests/vaddr-kunit.h} (100%) diff --git a/mm/damon/core.c b/mm/damon/core.c index 7a87628b76ab..086294279b07 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -2205,4 +2205,4 @@ static int __init damon_init(void) subsys_initcall(damon_init); -#include "core-test.h" +#include "tests/core-kunit.h" diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 51a6f1cac385..b4213bc47e44 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -1145,4 +1145,4 @@ out: module_init(damon_dbgfs_init); -#include "dbgfs-test.h" +#include "tests/dbgfs-kunit.h" diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index cffc755e7775..58145d59881d 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -1882,4 +1882,4 @@ out: } subsys_initcall(damon_sysfs_init); -#include "sysfs-test.h" +#include "tests/sysfs-kunit.h" diff --git a/mm/damon/core-test.h b/mm/damon/tests/core-kunit.h similarity index 100% rename from mm/damon/core-test.h rename to mm/damon/tests/core-kunit.h diff --git a/mm/damon/dbgfs-test.h b/mm/damon/tests/dbgfs-kunit.h similarity index 100% rename from mm/damon/dbgfs-test.h rename to mm/damon/tests/dbgfs-kunit.h diff --git a/mm/damon/sysfs-test.h b/mm/damon/tests/sysfs-kunit.h similarity index 100% rename from mm/damon/sysfs-test.h rename to mm/damon/tests/sysfs-kunit.h diff --git a/mm/damon/vaddr-test.h b/mm/damon/tests/vaddr-kunit.h similarity index 100% rename from mm/damon/vaddr-test.h rename to mm/damon/tests/vaddr-kunit.h diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 58829baf8b5d..b0e8b361891d 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -730,4 +730,4 @@ static int __init damon_va_initcall(void) subsys_initcall(damon_va_initcall); -#include "vaddr-test.h" +#include "tests/vaddr-kunit.h" From f66ac836d4b97df8615b0fd09ce0b470e66d1d59 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 26 Aug 2024 20:03:36 -0700 Subject: [PATCH 288/416] mm/damon/tests: add .kunitconfig file for DAMON kunit tests '--kunitconfig' option of 'kunit.py run' supports '.kunitconfig' file name convention. Add the file for DAMON kunit tests for more convenient kunit run. Link: https://lkml.kernel.org/r/20240827030336.7930-10-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Gow Signed-off-by: Andrew Morton --- mm/damon/tests/.kunitconfig | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 mm/damon/tests/.kunitconfig diff --git a/mm/damon/tests/.kunitconfig b/mm/damon/tests/.kunitconfig new file mode 100644 index 000000000000..a73be044fc9b --- /dev/null +++ b/mm/damon/tests/.kunitconfig @@ -0,0 +1,22 @@ +# for DAMON core +CONFIG_KUNIT=y +CONFIG_DAMON=y +CONFIG_DAMON_KUNIT_TEST=y + +# for DAMON vaddr ops +CONFIG_MMU=y +CONFIG_PAGE_IDLE_FLAG=y +CONFIG_DAMON_VADDR=y +CONFIG_DAMON_VADDR_KUNIT_TEST=y + +# for DAMON sysfs interface +CONFIG_SYSFS=y +CONFIG_DAMON_SYSFS=y +CONFIG_DAMON_SYSFS_KUNIT_TEST=y + +# for DAMON debugfs interface +CONFIG_DEBUG_FS=y +CONFIG_DAMON_PADDR=y +CONFIG_DAMON_DBGFS_DEPRECATED=y +CONFIG_DAMON_DBGFS=y +CONFIG_DAMON_DBGFS_KUNIT_TEST=y From b62b51d2d1593e6590669e781768c3c5a7394eb5 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 27 Aug 2024 19:47:24 +0800 Subject: [PATCH 289/416] mm: memory_hotplug: remove head variable in do_migrate_range() Patch series "mm: memory_hotplug: improve do_migrate_range()", v3. Unify hwpoisoned page handling and isolation of HugeTLB/LRU/non-LRU movable page, also convert to use folios in do_migrate_range(). This patch (of 5): Directly use a folio for HugeTLB and THP when calculate the next pfn, then remove unused head variable. Link: https://lkml.kernel.org/r/20240827114728.3212578-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240827114728.3212578-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Miaohe Lin Acked-by: David Hildenbrand Cc: Dan Carpenter Cc: Jonathan Cameron Cc: Naoya Horiguchi Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index df291f2e509d..3f8a109d5129 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1773,7 +1773,7 @@ found: static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; - struct page *page, *head; + struct page *page; LIST_HEAD(source); static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -1786,14 +1786,20 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) continue; page = pfn_to_page(pfn); folio = page_folio(page); - head = &folio->page; - if (PageHuge(page)) { - pfn = page_to_pfn(head) + compound_nr(head) - 1; - isolate_hugetlb(folio, &source); - continue; - } else if (PageTransHuge(page)) - pfn = page_to_pfn(head) + thp_nr_pages(page) - 1; + /* + * No reference or lock is held on the folio, so it might + * be modified concurrently (e.g. split). As such, + * folio_nr_pages() may read garbage. This is fine as the outer + * loop will revisit the split folio later. + */ + if (folio_test_large(folio)) { + pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; + if (folio_test_hugetlb(folio)) { + isolate_hugetlb(folio, &source); + continue; + } + } /* * HWPoison pages have elevated reference counts so the migration would From 16038c4fffd802cf5bc7391fedfe8228dd0adf53 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 27 Aug 2024 19:47:25 +0800 Subject: [PATCH 290/416] mm: memory-failure: add unmap_poisoned_folio() Add unmap_poisoned_folio() helper which will be reused by do_migrate_range() from memory hotplug soon. [akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin] Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Acked-by: Miaohe Lin Cc: Dan Carpenter Cc: Jonathan Cameron Cc: Naoya Horiguchi Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/internal.h | 8 ++++++++ mm/memory-failure.c | 44 +++++++++++++++++++++++++++----------------- 2 files changed, 35 insertions(+), 17 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 5e6f2abcea28..b00ea4595d18 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1048,6 +1048,8 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask) /* * mm/memory-failure.c */ +#ifdef CONFIG_MEMORY_FAILURE +void unmap_poisoned_folio(struct folio *folio, enum ttu_flags ttu); void shake_folio(struct folio *folio); extern int hwpoison_filter(struct page *p); @@ -1068,6 +1070,12 @@ void add_to_kill_ksm(struct task_struct *tsk, struct page *p, unsigned long ksm_addr); unsigned long page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); +#else +static inline void unmap_poisoned_folio(struct folio *folio, enum ttu_flags ttu) +{ +} +#endif + extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7066fc84f351..1f4f31e6d91d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1554,6 +1554,32 @@ static int get_hwpoison_page(struct page *p, unsigned long flags) return ret; } +void unmap_poisoned_folio(struct folio *folio, enum ttu_flags ttu) +{ + if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { + struct address_space *mapping; + + /* + * For hugetlb folios in shared mappings, try_to_unmap + * could potentially call huge_pmd_unshare. Because of + * this, take semaphore in write mode here and set + * TTU_RMAP_LOCKED to indicate we have taken the lock + * at this higher level. + */ + mapping = hugetlb_folio_mapping_lock_write(folio); + if (!mapping) { + pr_info("%#lx: could not lock mapping for mapped hugetlb folio\n", + folio_pfn(folio)); + return; + } + + try_to_unmap(folio, ttu|TTU_RMAP_LOCKED); + i_mmap_unlock_write(mapping); + } else { + try_to_unmap(folio, ttu); + } +} + /* * Do all that is necessary to remove user space mappings. Unmap * the pages and send SIGBUS to the processes if the data was dirty. @@ -1615,23 +1641,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, */ collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); - if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { - /* - * For hugetlb pages in shared mappings, try_to_unmap - * could potentially call huge_pmd_unshare. Because of - * this, take semaphore in write mode here and set - * TTU_RMAP_LOCKED to indicate we have taken the lock - * at this higher level. - */ - mapping = hugetlb_folio_mapping_lock_write(folio); - if (mapping) { - try_to_unmap(folio, ttu|TTU_RMAP_LOCKED); - i_mmap_unlock_write(mapping); - } else - pr_info("%#lx: could not lock mapping for mapped huge page\n", pfn); - } else { - try_to_unmap(folio, ttu); - } + unmap_poisoned_folio(folio, ttu); unmap_success = !folio_mapped(folio); if (!unmap_success) From e8a796fa1c16d4792c674dfb0c9dc86f48a00027 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 27 Aug 2024 19:47:26 +0800 Subject: [PATCH 291/416] mm: memory_hotplug: check hwpoisoned page firstly in do_migrate_range() Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") don't handle the hugetlb pages, the endless loop still occur if offline a hwpoison hugetlb page, luckly, after the commit e591ef7d96d6 ("mm, hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage"), the HPageMigratable of hugetlb page will be cleared, and the hwpoison hugetlb page will be skipped in scan_movable_pages(), so the endless loop issue is fixed. However if the HPageMigratable() check passed(without reference and lock), the hugetlb page may be hwpoisoned, it won't cause issue since the hwpoisoned page will be handled correctly in the next movable pages scan loop, and it will be isolated in do_migrate_range() but fails to migrate. In order to avoid the unnecessary isolation and unify all hwpoisoned page handling, let's unconditionally check hwpoison firstly, and if it is a hwpoisoned hugetlb page, try to unmap it as the catch all safety net like normal page does. Link: https://lkml.kernel.org/r/20240827114728.3212578-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Miaohe Lin Cc: Dan Carpenter Cc: Jonathan Cameron Cc: Naoya Horiguchi Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 3f8a109d5129..6bdac4cddd9f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1793,26 +1793,26 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) * folio_nr_pages() may read garbage. This is fine as the outer * loop will revisit the split folio later. */ - if (folio_test_large(folio)) { + if (folio_test_large(folio)) pfn = folio_pfn(folio) + folio_nr_pages(folio) - 1; - if (folio_test_hugetlb(folio)) { - isolate_hugetlb(folio, &source); - continue; - } - } /* * HWPoison pages have elevated reference counts so the migration would * fail on them. It also doesn't make any sense to migrate them in the * first place. Still try to unmap such a page in case it is still mapped - * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep - * the unmap as the catch all safety net). + * (keep the unmap as the catch all safety net). */ - if (PageHWPoison(page)) { + if (folio_test_hwpoison(folio) || + (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) { if (WARN_ON(folio_test_lru(folio))) folio_isolate_lru(folio); if (folio_mapped(folio)) - try_to_unmap(folio, TTU_IGNORE_MLOCK); + unmap_poisoned_folio(folio, TTU_IGNORE_MLOCK); + continue; + } + + if (folio_test_hugetlb(folio)) { + isolate_hugetlb(folio, &source); continue; } From f1264e9531b0fbadf88e0b9c8143c546343b481c Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 27 Aug 2024 19:47:27 +0800 Subject: [PATCH 292/416] mm: migrate: add isolate_folio_to_list() Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU movable and LRU folios to a list, which will be reused by do_migrate_range() from memory hotplug soon, also drop the mf_isolate_folio() since we could directly use new helper in the soft_offline_in_use_page(). Link: https://lkml.kernel.org/r/20240827114728.3212578-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Acked-by: Miaohe Lin Tested-by: Miaohe Lin Cc: Dan Carpenter Cc: Jonathan Cameron Cc: Naoya Horiguchi Cc: Oscar Salvador Signed-off-by: Andrew Morton --- include/linux/migrate.h | 3 +++ mm/memory-failure.c | 48 +++++++++++------------------------------ mm/migrate.c | 26 ++++++++++++++++++++++ 3 files changed, 42 insertions(+), 35 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 644be30b69c8..002e49b2ebd9 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -70,6 +70,7 @@ int migrate_pages(struct list_head *l, new_folio_t new, free_folio_t free, unsigned int *ret_succeeded); struct folio *alloc_migration_target(struct folio *src, unsigned long private); bool isolate_movable_page(struct page *page, isolate_mode_t mode); +bool isolate_folio_to_list(struct folio *folio, struct list_head *list); int migrate_huge_page_move_mapping(struct address_space *mapping, struct folio *dst, struct folio *src); @@ -91,6 +92,8 @@ static inline struct folio *alloc_migration_target(struct folio *src, { return NULL; } static inline bool isolate_movable_page(struct page *page, isolate_mode_t mode) { return false; } +static inline bool isolate_folio_to_list(struct folio *folio, struct list_head *list) + { return false; } static inline int migrate_huge_page_move_mapping(struct address_space *mapping, struct folio *dst, struct folio *src) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 1f4f31e6d91d..96ce31e5a203 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -2653,40 +2653,6 @@ EXPORT_SYMBOL(unpoison_memory); #undef pr_fmt #define pr_fmt(fmt) "Soft offline: " fmt -static bool mf_isolate_folio(struct folio *folio, struct list_head *pagelist) -{ - bool isolated = false; - - if (folio_test_hugetlb(folio)) { - isolated = isolate_hugetlb(folio, pagelist); - } else { - bool lru = !__folio_test_movable(folio); - - if (lru) - isolated = folio_isolate_lru(folio); - else - isolated = isolate_movable_page(&folio->page, - ISOLATE_UNEVICTABLE); - - if (isolated) { - list_add(&folio->lru, pagelist); - if (lru) - node_stat_add_folio(folio, NR_ISOLATED_ANON + - folio_is_file_lru(folio)); - } - } - - /* - * If we succeed to isolate the folio, we grabbed another refcount on - * the folio, so we can safely drop the one we got from get_any_page(). - * If we failed to isolate the folio, it means that we cannot go further - * and we will return an error, so drop the reference we got from - * get_any_page() as well. - */ - folio_put(folio); - return isolated; -} - /* * soft_offline_in_use_page handles hugetlb-pages and non-hugetlb pages. * If the page is a non-dirty unmapped page-cache page, it simply invalidates. @@ -2699,6 +2665,7 @@ static int soft_offline_in_use_page(struct page *page) struct folio *folio = page_folio(page); char const *msg_page[] = {"page", "hugepage"}; bool huge = folio_test_hugetlb(folio); + bool isolated; LIST_HEAD(pagelist); struct migration_target_control mtc = { .nid = NUMA_NO_NODE, @@ -2738,7 +2705,18 @@ static int soft_offline_in_use_page(struct page *page) return 0; } - if (mf_isolate_folio(folio, &pagelist)) { + isolated = isolate_folio_to_list(folio, &pagelist); + + /* + * If we succeed to isolate the folio, we grabbed another refcount on + * the folio, so we can safely drop the one we got from get_any_page(). + * If we failed to isolate the folio, it means that we cannot go further + * and we will return an error, so drop the reference we got from + * get_any_page() as well. + */ + folio_put(folio); + + if (isolated) { ret = migrate_pages(&pagelist, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE, NULL); if (!ret) { diff --git a/mm/migrate.c b/mm/migrate.c index 2250baa85ac2..ba4893d42618 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -178,6 +178,32 @@ void putback_movable_pages(struct list_head *l) } } +/* Must be called with an elevated refcount on the non-hugetlb folio */ +bool isolate_folio_to_list(struct folio *folio, struct list_head *list) +{ + bool isolated, lru; + + if (folio_test_hugetlb(folio)) + return isolate_hugetlb(folio, list); + + lru = !__folio_test_movable(folio); + if (lru) + isolated = folio_isolate_lru(folio); + else + isolated = isolate_movable_page(&folio->page, + ISOLATE_UNEVICTABLE); + + if (!isolated) + return false; + + list_add(&folio->lru, list); + if (lru) + node_stat_add_folio(folio, NR_ISOLATED_ANON + + folio_is_file_lru(folio)); + + return true; +} + /* * Restore a potential migration pte to a working pte entry */ From 6f1833b8208c3b9e59eff10792667b6639365146 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Tue, 27 Aug 2024 19:47:28 +0800 Subject: [PATCH 293/416] mm: memory_hotplug: unify Huge/LRU/non-LRU movable folio isolation Use the isolate_folio_to_list() to unify hugetlb/LRU/non-LRU folio isolation, which cleanup code a bit and save a few calls to compound_head(). [wangkefeng.wang@huawei.com: various fixes] Link: https://lkml.kernel.org/r/20240829150500.2599549-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240827114728.3212578-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Miaohe Lin Cc: Dan Carpenter Cc: David Hildenbrand Cc: Jonathan Cameron Cc: Naoya Horiguchi Cc: Oscar Salvador Signed-off-by: Andrew Morton --- mm/memory_hotplug.c | 43 +++++++++++++++---------------------------- 1 file changed, 15 insertions(+), 28 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6bdac4cddd9f..4c6e3b181204 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1772,15 +1772,14 @@ found: static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) { + struct folio *folio; unsigned long pfn; - struct page *page; LIST_HEAD(source); static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); for (pfn = start_pfn; pfn < end_pfn; pfn++) { - struct folio *folio; - bool isolated; + struct page *page; if (!pfn_valid(pfn)) continue; @@ -1811,34 +1810,21 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) continue; } - if (folio_test_hugetlb(folio)) { - isolate_hugetlb(folio, &source); + if (!folio_try_get(folio)) continue; - } - if (!get_page_unless_zero(page)) - continue; - /* - * We can skip free pages. And we can deal with pages on - * LRU and non-lru movable pages. - */ - if (PageLRU(page)) - isolated = isolate_lru_page(page); - else - isolated = isolate_movable_page(page, ISOLATE_UNEVICTABLE); - if (isolated) { - list_add_tail(&page->lru, &source); - if (!__PageMovable(page)) - inc_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_lru(page)); + if (unlikely(page_folio(page) != folio)) + goto put_folio; - } else { + if (!isolate_folio_to_list(folio, &source)) { if (__ratelimit(&migrate_rs)) { - pr_warn("failed to isolate pfn %lx\n", pfn); + pr_warn("failed to isolate pfn %lx\n", + page_to_pfn(page)); dump_page(page, "isolation failed"); } } - put_page(page); +put_folio: + folio_put(folio); } if (!list_empty(&source)) { nodemask_t nmask = node_states[N_MEMORY]; @@ -1853,7 +1839,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) * We have checked that migration range is on a single zone so * we can use the nid of the first page to all the others. */ - mtc.nid = page_to_nid(list_first_entry(&source, struct page, lru)); + mtc.nid = folio_nid(list_first_entry(&source, struct folio, lru)); /* * try to allocate from a different node but reuse this node @@ -1866,11 +1852,12 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) ret = migrate_pages(&source, alloc_migration_target, NULL, (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG, NULL); if (ret) { - list_for_each_entry(page, &source, lru) { + list_for_each_entry(folio, &source, lru) { if (__ratelimit(&migrate_rs)) { pr_warn("migrating pfn %lx failed ret:%d\n", - page_to_pfn(page), ret); - dump_page(page, "migration failure"); + folio_pfn(folio), ret); + dump_page(&folio->page, + "migration failure"); } } putback_movable_pages(&source); From 246d3aa3e53151fa150f10257ddd8a4facd31a6a Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Thu, 8 Aug 2024 12:18:46 +0100 Subject: [PATCH 294/416] mm: cleanup count_mthp_stat() definition Patch series "Shmem mTHP controls and stats improvements", v3. This is a small series to tidy up the way the shmem controls and stats are exposed. These patches were previously part of the series at [2], but I decided to split them out since they can go in independently. This patch (of 2): Let's move count_mthp_stat() so that it's always defined, even when THP is disabled. Previously uses of the function in files such as shmem.c, which are compiled even when THP is disabled, required ugly THP ifdeferry. With this cleanup, we can remove those ifdefs and the function resolves to a nop when THP is disabled. I shortly plan to call count_mthp_stat() from more THP-invariant source files. Link: https://lkml.kernel.org/r/20240808111849.651867-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240808111849.651867-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Acked-by: Barry Song Reviewed-by: Baolin Wang Reviewed-by: Lance Yang Acked-by: David Hildenbrand Cc: Gavin Shan Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 70 ++++++++++++++++++++--------------------- mm/memory.c | 2 -- mm/shmem.c | 6 ---- 3 files changed, 35 insertions(+), 43 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 6370026689e0..4c32058cacfe 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -114,6 +114,41 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) +enum mthp_stat_item { + MTHP_STAT_ANON_FAULT_ALLOC, + MTHP_STAT_ANON_FAULT_FALLBACK, + MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_SWPOUT, + MTHP_STAT_SWPOUT_FALLBACK, + MTHP_STAT_SHMEM_ALLOC, + MTHP_STAT_SHMEM_FALLBACK, + MTHP_STAT_SHMEM_FALLBACK_CHARGE, + MTHP_STAT_SPLIT, + MTHP_STAT_SPLIT_FAILED, + MTHP_STAT_SPLIT_DEFERRED, + __MTHP_STAT_COUNT +}; + +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS) +struct mthp_stat { + unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT]; +}; + +DECLARE_PER_CPU(struct mthp_stat, mthp_stats); + +static inline void count_mthp_stat(int order, enum mthp_stat_item item) +{ + if (order <= 0 || order > PMD_ORDER) + return; + + this_cpu_inc(mthp_stats.stats[order][item]); +} +#else +static inline void count_mthp_stat(int order, enum mthp_stat_item item) +{ +} +#endif + #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern unsigned long transparent_hugepage_flags; @@ -269,41 +304,6 @@ struct thpsize { #define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj) -enum mthp_stat_item { - MTHP_STAT_ANON_FAULT_ALLOC, - MTHP_STAT_ANON_FAULT_FALLBACK, - MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, - MTHP_STAT_SWPOUT, - MTHP_STAT_SWPOUT_FALLBACK, - MTHP_STAT_SHMEM_ALLOC, - MTHP_STAT_SHMEM_FALLBACK, - MTHP_STAT_SHMEM_FALLBACK_CHARGE, - MTHP_STAT_SPLIT, - MTHP_STAT_SPLIT_FAILED, - MTHP_STAT_SPLIT_DEFERRED, - __MTHP_STAT_COUNT -}; - -struct mthp_stat { - unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT]; -}; - -#ifdef CONFIG_SYSFS -DECLARE_PER_CPU(struct mthp_stat, mthp_stats); - -static inline void count_mthp_stat(int order, enum mthp_stat_item item) -{ - if (order <= 0 || order > PMD_ORDER) - return; - - this_cpu_inc(mthp_stats.stats[order][item]); -} -#else -static inline void count_mthp_stat(int order, enum mthp_stat_item item) -{ -} -#endif - #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1<vm_mm, MM_ANONPAGES, nr_pages); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); -#endif folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); setpte: diff --git a/mm/shmem.c b/mm/shmem.c index 866d46d0c43d..800cec9dc534 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1808,9 +1808,7 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, if (pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK); -#endif order = next_order(&suitable_orders, order); } } else { @@ -1835,10 +1833,8 @@ allocated: count_vm_event(THP_FILE_FALLBACK); count_vm_event(THP_FILE_FALLBACK_CHARGE); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK); count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK_CHARGE); -#endif } goto unlock; } @@ -2332,9 +2328,7 @@ repeat: if (!IS_ERR(folio)) { if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_ALLOC); -#endif goto alloced; } if (PTR_ERR(folio) == -EEXIST) From 70e59a75283bdcc49c8ee104c2d49a22e4912305 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Thu, 8 Aug 2024 12:18:47 +0100 Subject: [PATCH 295/416] mm: tidy up shmem mTHP controls and stats Previously we had a situation where shmem mTHP controls and stats were not exposed for some supported sizes and were exposed for some unsupported sizes. So let's clean that up. Anon mTHP can support all large orders [2, PMD_ORDER]. But shmem can support all large orders [1, MAX_PAGECACHE_ORDER]. However, per-size shmem controls and stats were previously being exposed for all the anon mTHP orders, meaning order-1 was not present, and for arm64 64K base pages, orders 12 and 13 were exposed but were not supported internally. Tidy this all up by defining ctrl and stats attribute groups for anon and file separately. Anon ctrl and stats groups are populated for all orders in THP_ORDERS_ALL_ANON and file ctrl and stats groups are populated for all orders in THP_ORDERS_ALL_FILE_DEFAULT. Additionally, create "any" ctrl and stats attribute groups which are populated for all orders in (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DEFAULT). swpout stats use this since they apply to anon and shmem. The side-effect of all this is that different hugepage-*kB directories contain different sets of controls and stats, depending on which memory types support that size. This approach is preferred over the alternative, which is to populate dummy controls and stats for memory types that do not support a given size. [ryan.roberts@arm.com: file pages and shmem can also be split] Link: https://lkml.kernel.org/r/f7ced14c-8bc5-405f-bee7-94f63980f525@arm.comLink: https://lkml.kernel.org/r/20240808111849.651867-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Tested-by: Barry Song Reviewed-by: Baolin Wang Cc: David Hildenbrand Cc: Gavin Shan Cc: Hugh Dickins Cc: Lance Yang Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/huge_memory.c | 148 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 116 insertions(+), 32 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4b45ad576f6..3e872ee724cc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -460,8 +460,8 @@ static void thpsize_release(struct kobject *kobj); static DEFINE_SPINLOCK(huge_anon_orders_lock); static LIST_HEAD(thpsize_list); -static ssize_t thpsize_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) +static ssize_t anon_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) { int order = to_thpsize(kobj)->order; const char *output; @@ -478,9 +478,9 @@ static ssize_t thpsize_enabled_show(struct kobject *kobj, return sysfs_emit(buf, "%s\n", output); } -static ssize_t thpsize_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) +static ssize_t anon_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) { int order = to_thpsize(kobj)->order; ssize_t ret = count; @@ -522,19 +522,35 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj, return ret; } -static struct kobj_attribute thpsize_enabled_attr = - __ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store); +static struct kobj_attribute anon_enabled_attr = + __ATTR(enabled, 0644, anon_enabled_show, anon_enabled_store); -static struct attribute *thpsize_attrs[] = { - &thpsize_enabled_attr.attr, +static struct attribute *anon_ctrl_attrs[] = { + &anon_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group anon_ctrl_attr_grp = { + .attrs = anon_ctrl_attrs, +}; + +static struct attribute *file_ctrl_attrs[] = { #ifdef CONFIG_SHMEM &thpsize_shmem_enabled_attr.attr, #endif NULL, }; -static const struct attribute_group thpsize_attr_group = { - .attrs = thpsize_attrs, +static const struct attribute_group file_ctrl_attr_grp = { + .attrs = file_ctrl_attrs, +}; + +static struct attribute *any_ctrl_attrs[] = { + NULL, +}; + +static const struct attribute_group any_ctrl_attr_grp = { + .attrs = any_ctrl_attrs, }; static const struct kobj_type thpsize_ktype = { @@ -573,64 +589,132 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); +#ifdef CONFIG_SHMEM DEFINE_MTHP_STAT_ATTR(shmem_alloc, MTHP_STAT_SHMEM_ALLOC); DEFINE_MTHP_STAT_ATTR(shmem_fallback, MTHP_STAT_SHMEM_FALLBACK); DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE); +#endif DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); -static struct attribute *stats_attrs[] = { +static struct attribute *anon_stats_attrs[] = { &anon_fault_alloc_attr.attr, &anon_fault_fallback_attr.attr, &anon_fault_fallback_charge_attr.attr, +#ifndef CONFIG_SHMEM &swpout_attr.attr, &swpout_fallback_attr.attr, - &shmem_alloc_attr.attr, - &shmem_fallback_attr.attr, - &shmem_fallback_charge_attr.attr, - &split_attr.attr, - &split_failed_attr.attr, +#endif &split_deferred_attr.attr, NULL, }; -static struct attribute_group stats_attr_group = { +static struct attribute_group anon_stats_attr_grp = { .name = "stats", - .attrs = stats_attrs, + .attrs = anon_stats_attrs, }; +static struct attribute *file_stats_attrs[] = { +#ifdef CONFIG_SHMEM + &shmem_alloc_attr.attr, + &shmem_fallback_attr.attr, + &shmem_fallback_charge_attr.attr, +#endif + NULL, +}; + +static struct attribute_group file_stats_attr_grp = { + .name = "stats", + .attrs = file_stats_attrs, +}; + +static struct attribute *any_stats_attrs[] = { +#ifdef CONFIG_SHMEM + &swpout_attr.attr, + &swpout_fallback_attr.attr, +#endif + &split_attr.attr, + &split_failed_attr.attr, + NULL, +}; + +static struct attribute_group any_stats_attr_grp = { + .name = "stats", + .attrs = any_stats_attrs, +}; + +static int sysfs_add_group(struct kobject *kobj, + const struct attribute_group *grp) +{ + int ret = -ENOENT; + + /* + * If the group is named, try to merge first, assuming the subdirectory + * was already created. This avoids the warning emitted by + * sysfs_create_group() if the directory already exists. + */ + if (grp->name) + ret = sysfs_merge_group(kobj, grp); + if (ret) + ret = sysfs_create_group(kobj, grp); + + return ret; +} + static struct thpsize *thpsize_create(int order, struct kobject *parent) { unsigned long size = (PAGE_SIZE << order) / SZ_1K; struct thpsize *thpsize; - int ret; + int ret = -ENOMEM; thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL); if (!thpsize) - return ERR_PTR(-ENOMEM); + goto err; + + thpsize->order = order; ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent, "hugepages-%lukB", size); if (ret) { kfree(thpsize); - return ERR_PTR(ret); + goto err; } - ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group); - if (ret) { - kobject_put(&thpsize->kobj); - return ERR_PTR(ret); + + ret = sysfs_add_group(&thpsize->kobj, &any_ctrl_attr_grp); + if (ret) + goto err_put; + + ret = sysfs_add_group(&thpsize->kobj, &any_stats_attr_grp); + if (ret) + goto err_put; + + if (BIT(order) & THP_ORDERS_ALL_ANON) { + ret = sysfs_add_group(&thpsize->kobj, &anon_ctrl_attr_grp); + if (ret) + goto err_put; + + ret = sysfs_add_group(&thpsize->kobj, &anon_stats_attr_grp); + if (ret) + goto err_put; } - ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group); - if (ret) { - kobject_put(&thpsize->kobj); - return ERR_PTR(ret); + if (BIT(order) & THP_ORDERS_ALL_FILE_DEFAULT) { + ret = sysfs_add_group(&thpsize->kobj, &file_ctrl_attr_grp); + if (ret) + goto err_put; + + ret = sysfs_add_group(&thpsize->kobj, &file_stats_attr_grp); + if (ret) + goto err_put; } - thpsize->order = order; return thpsize; +err_put: + kobject_put(&thpsize->kobj); +err: + return ERR_PTR(ret); } static void thpsize_release(struct kobject *kobj) @@ -671,7 +755,7 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj) goto remove_hp_group; } - orders = THP_ORDERS_ALL_ANON; + orders = THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DEFAULT; order = highest_order(orders); while (orders) { thpsize = thpsize_create(order, *hugepage_kobj); From 5d65c8d758f2596c008009e39bb2614deed2c730 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sat, 24 Aug 2024 13:04:40 +1200 Subject: [PATCH 296/416] mm: count the number of anonymous THPs per size Patch series "mm: count the number of anonymous THPs per size", v4. Knowing the number of transparent anon THPs in the system is crucial for performance analysis. It helps in understanding the ratio and distribution of THPs versus small folios throughout the system. Additionally, partial unmapping by userspace can lead to significant waste of THPs over time and increase memory reclamation pressure. We need this information for comprehensive system tuning. This patch (of 2): Let's track for each anonymous THP size, how many of them are currently allocated. We'll track the complete lifespan of an anon THP, starting when it becomes an anon THP ("large anon folio") (->mapping gets set), until it gets freed (->mapping gets cleared). Introduce a new "nr_anon" counter per THP size and adjust the corresponding counter in the following cases: * We allocate a new THP and call folio_add_new_anon_rmap() to map it the first time and turn it into an anon THP. * We split an anon THP into multiple smaller ones. * We migrate an anon THP, when we prepare the destination. * We free an anon THP back to the buddy. Note that AnonPages in /proc/meminfo currently tracks the total number of *mapped* anonymous *pages*, and therefore has slightly different semantics. In the future, we might also want to track "nr_anon_mapped" for each THP size, which might be helpful when comparing it to the number of allocated anon THPs (long-term pinning, stuck in swapcache, memory leaks, ...). Further note that for now, we only track anon THPs after they got their ->mapping set, for example via folio_add_new_anon_rmap(). If we would allocate some in the swapcache, they will only show up in the statistics for now after they have been mapped to user space the first time, where we call folio_add_new_anon_rmap(). [akpm@linux-foundation.org: documentation fixups, per David] Link: https://lkml.kernel.org/r/3e8add35-e26b-443b-8a04-1078f4bc78f6@redhat.com Link: https://lkml.kernel.org/r/20240824010441.21308-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240824010441.21308-2-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Chris Li Cc: Chuanhua Han Cc: Kairui Song Cc: Kalesh Singh Cc: Lance Yang Cc: Ryan Roberts Cc: Shuai Yuan Cc: Usama Arif Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 5 +++++ include/linux/huge_mm.h | 15 +++++++++++++-- mm/huge_memory.c | 13 ++++++++++--- mm/migrate.c | 4 ++++ mm/page_alloc.c | 5 ++++- mm/rmap.c | 1 + 6 files changed, 37 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 79435c537e21..d6874234a09a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -551,6 +551,11 @@ split_deferred it would free up some memory. Pages on split queue are going to be split under memory pressure, if splitting is possible. +nr_anon + the number of anonymous THP we have in the whole system. These THPs + might be currently entirely mapped or have partially unmapped/unused + subpages. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4c32058cacfe..2ee2971e4e10 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -126,6 +126,7 @@ enum mthp_stat_item { MTHP_STAT_SPLIT, MTHP_STAT_SPLIT_FAILED, MTHP_STAT_SPLIT_DEFERRED, + MTHP_STAT_NR_ANON, __MTHP_STAT_COUNT }; @@ -136,14 +137,24 @@ struct mthp_stat { DECLARE_PER_CPU(struct mthp_stat, mthp_stats); -static inline void count_mthp_stat(int order, enum mthp_stat_item item) +static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta) { if (order <= 0 || order > PMD_ORDER) return; - this_cpu_inc(mthp_stats.stats[order][item]); + this_cpu_add(mthp_stats.stats[order][item], delta); } + +static inline void count_mthp_stat(int order, enum mthp_stat_item item) +{ + mod_mthp_stat(order, item, 1); +} + #else +static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta) +{ +} + static inline void count_mthp_stat(int order, enum mthp_stat_item item) { } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3e872ee724cc..ea57c51d478d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -597,6 +597,7 @@ DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); +DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON); static struct attribute *anon_stats_attrs[] = { &anon_fault_alloc_attr.attr, @@ -607,6 +608,7 @@ static struct attribute *anon_stats_attrs[] = { &swpout_fallback_attr.attr, #endif &split_deferred_attr.attr, + &nr_anon_attr.attr, NULL, }; @@ -3314,8 +3316,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, struct deferred_split *ds_queue = get_deferred_split_queue(folio); /* reset xarray order to new order after split */ XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); - struct anon_vma *anon_vma = NULL; + bool is_anon = folio_test_anon(folio); struct address_space *mapping = NULL; + struct anon_vma *anon_vma = NULL; int order = folio_order(folio); int extra_pins, ret; pgoff_t end; @@ -3327,7 +3330,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (new_order >= folio_order(folio)) return -EINVAL; - if (folio_test_anon(folio)) { + if (is_anon) { /* order-1 is not supported for anonymous THP. */ if (new_order == 1) { VM_WARN_ONCE(1, "Cannot split to order-1 folio"); @@ -3367,7 +3370,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (folio_test_writeback(folio)) return -EBUSY; - if (folio_test_anon(folio)) { + if (is_anon) { /* * The caller does not necessarily hold an mmap_lock that would * prevent the anon_vma disappearing so we first we take a @@ -3480,6 +3483,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, } } + if (is_anon) { + mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); + mod_mthp_stat(new_order, MTHP_STAT_NR_ANON, 1 << (order - new_order)); + } __split_huge_page(page, list, end, new_order); ret = 0; } else { diff --git a/mm/migrate.c b/mm/migrate.c index ba4893d42618..6f9c62c746be 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -449,6 +449,8 @@ static int __folio_migrate_mapping(struct address_space *mapping, /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; + if (folio_test_anon(folio) && folio_test_large(folio)) + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1); if (folio_test_swapbacked(folio)) __folio_set_swapbacked(newfolio); @@ -473,6 +475,8 @@ static int __folio_migrate_mapping(struct address_space *mapping, */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; + if (folio_test_anon(folio) && folio_test_large(folio)) + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1); folio_ref_add(newfolio, nr); /* add cache reference */ if (folio_test_swapbacked(folio)) { __folio_set_swapbacked(newfolio); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6dd0b409e3c2..3469ca78bf7a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1084,8 +1084,11 @@ __always_inline bool free_pages_prepare(struct page *page, (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } - if (PageMappingFlags(page)) + if (PageMappingFlags(page)) { + if (PageAnon(page)) + mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); page->mapping = NULL; + } if (is_check_pages_enabled()) { if (free_page_is_bad(page)) bad++; diff --git a/mm/rmap.c b/mm/rmap.c index 1103a536e474..78529cf0fd66 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1467,6 +1467,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, } __folio_mod_stat(folio, nr, nr_pmdmapped); + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1); } static __always_inline void __folio_add_file_rmap(struct folio *folio, From 8175ebfd302abe6fbdca9037f763ecbfdb8db572 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sat, 24 Aug 2024 13:04:41 +1200 Subject: [PATCH 297/416] mm: count the number of partially mapped anonymous THPs per size When a THP is added to the deferred_list due to partially mapped, its partial pages are unused, leading to wasted memory and potentially increasing memory reclamation pressure. Detailing the specifics of how unmapping occurs is quite difficult and not that useful, so we adopt a simple approach: each time a THP enters the deferred_list, we increment the count by 1; whenever it leaves for any reason, we decrement the count by 1. Link: https://lkml.kernel.org/r/20240824010441.21308-3-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Chris Li Cc: Chuanhua Han Cc: Kairui Song Cc: Kalesh Singh Cc: Lance Yang Cc: Ryan Roberts Cc: Shuai Yuan Cc: Usama Arif Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 7 +++++++ include/linux/huge_mm.h | 1 + mm/huge_memory.c | 6 ++++++ 3 files changed, 14 insertions(+) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index d6874234a09a..56a086900651 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -556,6 +556,13 @@ nr_anon might be currently entirely mapped or have partially unmapped/unused subpages. +nr_anon_partially_mapped + the number of anonymous THP which are likely partially mapped, possibly + wasting memory, and have been queued for deferred memory reclamation. + Note that in corner some cases (e.g., failed migration), we might detect + an anonymous THP as "partially mapped" and count it here, even though it + is not actually partially mapped anymore. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2ee2971e4e10..4902e2f7e896 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -127,6 +127,7 @@ enum mthp_stat_item { MTHP_STAT_SPLIT_FAILED, MTHP_STAT_SPLIT_DEFERRED, MTHP_STAT_NR_ANON, + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, __MTHP_STAT_COUNT }; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ea57c51d478d..efc625086dc7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -598,6 +598,7 @@ DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON); +DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED); static struct attribute *anon_stats_attrs[] = { &anon_fault_alloc_attr.attr, @@ -609,6 +610,7 @@ static struct attribute *anon_stats_attrs[] = { #endif &split_deferred_attr.attr, &nr_anon_attr.attr, + &nr_anon_partially_mapped_attr.attr, NULL, }; @@ -3457,6 +3459,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (folio_order(folio) > 1 && !list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); /* * Reinitialize page_deferred_list after removing the * page from the split_queue, otherwise a subsequent @@ -3523,6 +3526,7 @@ void __folio_undo_large_rmappable(struct folio *folio) spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); list_del_init(&folio->_deferred_list); } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); @@ -3564,6 +3568,7 @@ void deferred_split_folio(struct folio *folio) if (folio_test_pmd_mappable(folio)) count_vm_event(THP_DEFERRED_SPLIT_PAGE); count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1); list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); ds_queue->split_queue_len++; #ifdef CONFIG_MEMCG @@ -3611,6 +3616,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, list_move(&folio->_deferred_list, &list); } else { /* We lost race with folio_put() */ + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); list_del_init(&folio->_deferred_list); ds_queue->split_queue_len--; } From 97b76796ccd07727b58ed739fcd60f2174d5ac18 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 26 Aug 2024 21:21:35 +0100 Subject: [PATCH 298/416] swap: convert swapon() to use a folio Retrieve a folio from the page cache rather than a page. Saves a couple of conversions between page & folio. Link: https://lkml.kernel.org/r/20240826202138.3804238-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/swapfile.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 8e317c495bfb..0cded32414a1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3357,7 +3357,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) unsigned char *swap_map = NULL; unsigned long *zeromap = NULL; struct swap_cluster_info *cluster_info = NULL; - struct page *page = NULL; + struct folio *folio = NULL; struct inode *inode = NULL; bool inced_nr_rotate_swap = false; @@ -3415,12 +3415,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) error = -EINVAL; goto bad_swap_unlock_inode; } - page = read_mapping_page(mapping, 0, swap_file); - if (IS_ERR(page)) { - error = PTR_ERR(page); + folio = read_mapping_folio(mapping, 0, swap_file); + if (IS_ERR(folio)) { + error = PTR_ERR(folio); goto bad_swap_unlock_inode; } - swap_header = kmap(page); + swap_header = kmap_local_folio(folio, 0); maxpages = read_swap_header(si, swap_header, inode); if (unlikely(!maxpages)) { @@ -3574,10 +3574,8 @@ bad_swap: if (swap_file) filp_close(swap_file, NULL); out: - if (page && !IS_ERR(page)) { - kunmap(page); - put_page(page); - } + if (!IS_ERR_OR_NULL(folio)) + folio_release_kmap(folio, swap_header); if (name) putname(name); if (inode) From 5c8525a37b78ddfee84ff6927b8838013ff2521e Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:09 +0800 Subject: [PATCH 299/416] mm: migrate_device: convert to migrate_device_coherent_folio() Patch series "mm: finish isolate/putback_lru_page()". Convert to use more folios in migrate_device.c, then we could remove isolate_lru_page() and putback_lru_page(). This patch (of 6): Save a few calls to compound_head() and use folio throughout. Link: https://lkml.kernel.org/r/20240826065814.1336616-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240826065814.1336616-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Alistair Popple Cc: Baolin Wang Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/gup.c | 2 +- mm/internal.h | 2 +- mm/migrate_device.c | 30 +++++++++++++++--------------- 3 files changed, 17 insertions(+), 17 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index d19884e097fd..5defd5e6d8f8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2335,7 +2335,7 @@ static int migrate_longterm_unpinnable_folios( folio_get(folio); gup_put_folio(folio, 1, FOLL_PIN); - if (migrate_device_coherent_page(&folio->page)) { + if (migrate_device_coherent_folio(folio)) { ret = -EBUSY; goto err; } diff --git a/mm/internal.h b/mm/internal.h index b00ea4595d18..2f995f91a088 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1208,7 +1208,7 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, int *last_cpupid); void free_zone_device_folio(struct folio *folio); -int migrate_device_coherent_page(struct page *page); +int migrate_device_coherent_folio(struct folio *folio); /* * mm/gup.c diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6d66dc1c6ffa..82d75205dda8 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -708,7 +708,7 @@ static void __migrate_device_pages(unsigned long *src_pfns, /* * The only time there is no vma is when called from - * migrate_device_coherent_page(). However this isn't + * migrate_device_coherent_folio(). However this isn't * called if the page could not be unmapped. */ VM_BUG_ON(!migrate); @@ -921,38 +921,38 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start, EXPORT_SYMBOL(migrate_device_range); /* - * Migrate a device coherent page back to normal memory. The caller should have - * a reference on page which will be copied to the new page if migration is + * Migrate a device coherent folio back to normal memory. The caller should have + * a reference on folio which will be copied to the new folio if migration is * successful or dropped on failure. */ -int migrate_device_coherent_page(struct page *page) +int migrate_device_coherent_folio(struct folio *folio) { unsigned long src_pfn, dst_pfn = 0; - struct page *dpage; + struct folio *dfolio; - WARN_ON_ONCE(PageCompound(page)); + WARN_ON_ONCE(folio_test_large(folio)); - lock_page(page); - src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; + folio_lock(folio); + src_pfn = migrate_pfn(folio_pfn(folio)) | MIGRATE_PFN_MIGRATE; /* * We don't have a VMA and don't need to walk the page tables to find - * the source page. So call migrate_vma_unmap() directly to unmap the - * page as migrate_vma_setup() will fail if args.vma == NULL. + * the source folio. So call migrate_vma_unmap() directly to unmap the + * folio as migrate_vma_setup() will fail if args.vma == NULL. */ migrate_device_unmap(&src_pfn, 1, NULL); if (!(src_pfn & MIGRATE_PFN_MIGRATE)) return -EBUSY; - dpage = alloc_page(GFP_USER | __GFP_NOWARN); - if (dpage) { - lock_page(dpage); - dst_pfn = migrate_pfn(page_to_pfn(dpage)); + dfolio = folio_alloc(GFP_USER | __GFP_NOWARN, 0); + if (dfolio) { + folio_lock(dfolio); + dst_pfn = migrate_pfn(folio_pfn(dfolio)); } migrate_device_pages(&src_pfn, &dst_pfn, 1); if (src_pfn & MIGRATE_PFN_MIGRATE) - copy_highpage(dpage, page); + folio_copy(dfolio, folio); migrate_device_finalize(&src_pfn, &dst_pfn, 1); if (src_pfn & MIGRATE_PFN_MIGRATE) From 53456b7b3f4c3427ff04ae5c92e6dba1b9bfbb23 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:10 +0800 Subject: [PATCH 300/416] mm: migrate_device: use a folio in migrate_device_range() Save two calls to compound_head() and use folio throughout. Link: https://lkml.kernel.org/r/20240826065814.1336616-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Alistair Popple Cc: Baolin Wang Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate_device.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 82d75205dda8..66db28b89f9b 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -898,16 +898,17 @@ int migrate_device_range(unsigned long *src_pfns, unsigned long start, unsigned long i, pfn; for (pfn = start, i = 0; i < npages; pfn++, i++) { - struct page *page = pfn_to_page(pfn); + struct folio *folio; - if (!get_page_unless_zero(page)) { + folio = folio_get_nontail_page(pfn_to_page(pfn)); + if (!folio) { src_pfns[i] = 0; continue; } - if (!trylock_page(page)) { + if (!folio_trylock(folio)) { src_pfns[i] = 0; - put_page(page); + folio_put(folio); continue; } From 39e618d986e46b02d0f0efcdeb7693f65a51a917 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:11 +0800 Subject: [PATCH 301/416] mm: migrate_device: use more folio in migrate_device_unmap() The page for migrate_device_unmap() already has a reference, so it is safe to convert the page to folio to save a few calls to compound_head(), which removes the last isolate_lru_page() call. Link: https://lkml.kernel.org/r/20240826065814.1336616-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Alistair Popple Cc: Baolin Wang Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate_device.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 66db28b89f9b..b49f4956617a 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -379,33 +379,33 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns, continue; } - /* ZONE_DEVICE pages are not on LRU */ - if (!is_zone_device_page(page)) { - if (!PageLRU(page) && allow_drain) { + folio = page_folio(page); + /* ZONE_DEVICE folios are not on LRU */ + if (!folio_is_zone_device(folio)) { + if (!folio_test_lru(folio) && allow_drain) { /* Drain CPU's lru cache */ lru_add_drain_all(); allow_drain = false; } - if (!isolate_lru_page(page)) { + if (!folio_isolate_lru(folio)) { src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; restore++; continue; } /* Drop the reference we took in collect */ - put_page(page); + folio_put(folio); } - folio = page_folio(page); if (folio_mapped(folio)) try_to_migrate(folio, 0); - if (page_mapped(page) || + if (folio_mapped(folio) || !migrate_vma_check_page(page, fault_page)) { - if (!is_zone_device_page(page)) { - get_page(page); - putback_lru_page(page); + if (!folio_is_zone_device(folio)) { + folio_get(folio); + folio_putback_lru(folio); } src_pfns[i] &= ~MIGRATE_PFN_MIGRATE; From 58bf8c2bf47550bc94fea9cafd2bc7304d97102c Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:12 +0800 Subject: [PATCH 302/416] mm: migrate_device: use more folio in migrate_device_finalize() Saves a couple of calls to compound_head() and remove last two callers of putback_lru_page(). Link: https://lkml.kernel.org/r/20240826065814.1336616-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Vishal Moola (Oracle) Reviewed-by: Alistair Popple Cc: Baolin Wang Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/migrate_device.c | 41 ++++++++++++++++++++++------------------- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index b49f4956617a..6ea3d055f520 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -815,42 +815,45 @@ void migrate_device_finalize(unsigned long *src_pfns, unsigned long i; for (i = 0; i < npages; i++) { - struct folio *dst, *src; + struct folio *dst = NULL, *src = NULL; struct page *newpage = migrate_pfn_to_page(dst_pfns[i]); struct page *page = migrate_pfn_to_page(src_pfns[i]); + if (newpage) + dst = page_folio(newpage); + if (!page) { - if (newpage) { - unlock_page(newpage); - put_page(newpage); + if (dst) { + folio_unlock(dst); + folio_put(dst); } continue; } - if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !newpage) { - if (newpage) { - unlock_page(newpage); - put_page(newpage); + src = page_folio(page); + + if (!(src_pfns[i] & MIGRATE_PFN_MIGRATE) || !dst) { + if (dst) { + folio_unlock(dst); + folio_put(dst); } - newpage = page; + dst = src; } - src = page_folio(page); - dst = page_folio(newpage); remove_migration_ptes(src, dst, false); folio_unlock(src); - if (is_zone_device_page(page)) - put_page(page); + if (folio_is_zone_device(src)) + folio_put(src); else - putback_lru_page(page); + folio_putback_lru(src); - if (newpage != page) { - unlock_page(newpage); - if (is_zone_device_page(newpage)) - put_page(newpage); + if (dst != src) { + folio_unlock(dst); + if (folio_is_zone_device(dst)) + folio_put(dst); else - putback_lru_page(newpage); + folio_putback_lru(dst); } } } From 775d28fd45a2f5e58d8927ce77693398b1f074a6 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:13 +0800 Subject: [PATCH 303/416] mm: remove isolate_lru_page() There are no more callers of isolate_lru_page(), remove it. [wangkefeng.wang@huawei.com: convert page to folio in comment and document, per Matthew] Link: https://lkml.kernel.org/r/20240826144114.1928071-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240826065814.1336616-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Vishal Moola (Oracle) Cc: Alistair Popple Cc: Baolin Wang Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- Documentation/mm/page_migration.rst | 22 +++++++++---------- Documentation/mm/unevictable-lru.rst | 4 ++-- .../translations/zh_CN/mm/page_migration.rst | 6 ++--- mm/filemap.c | 2 +- mm/folio-compat.c | 7 ------ mm/internal.h | 1 - mm/khugepaged.c | 8 +++---- mm/migrate_device.c | 4 ++-- mm/swap.c | 4 ++-- 9 files changed, 25 insertions(+), 33 deletions(-) diff --git a/Documentation/mm/page_migration.rst b/Documentation/mm/page_migration.rst index f1ce67a26615..519b35a4caf5 100644 --- a/Documentation/mm/page_migration.rst +++ b/Documentation/mm/page_migration.rst @@ -63,15 +63,15 @@ and then a low level description of how the low level details work. In kernel use of migrate_pages() ================================ -1. Remove pages from the LRU. +1. Remove folios from the LRU. - Lists of pages to be migrated are generated by scanning over - pages and moving them into lists. This is done by - calling isolate_lru_page(). - Calling isolate_lru_page() increases the references to the page - so that it cannot vanish while the page migration occurs. + Lists of folios to be migrated are generated by scanning over + folios and moving them into lists. This is done by + calling folio_isolate_lru(). + Calling folio_isolate_lru() increases the references to the folio + so that it cannot vanish while the folio migration occurs. It also prevents the swapper or other scans from encountering - the page. + the folio. 2. We need to have a function of type new_folio_t that can be passed to migrate_pages(). This function should figure out @@ -84,10 +84,10 @@ In kernel use of migrate_pages() How migrate_pages() works ========================= -migrate_pages() does several passes over its list of pages. A page is moved -if all references to a page are removable at the time. The page has -already been removed from the LRU via isolate_lru_page() and the refcount -is increased so that the page cannot be freed while page migration occurs. +migrate_pages() does several passes over its list of folios. A folio is moved +if all references to a folio are removable at the time. The folio has +already been removed from the LRU via folio_isolate_lru() and the refcount +is increased so that the folio cannot be freed while folio migration occurs. Steps: diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 255ef12a432b..8d11fe6a0854 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -80,7 +80,7 @@ on an additional LRU list for a few reasons: (2) We want to be able to migrate unevictable folios between nodes for memory defragmentation, workload management and memory hotplug. The Linux kernel can only migrate folios that it can successfully isolate from the LRU - lists (or "Movable" pages: outside of consideration here). If we were to + lists (or "Movable" folios: outside of consideration here). If we were to maintain folios elsewhere than on an LRU-like list, where they can be detected by folio_isolate_lru(), we would prevent their migration. @@ -230,7 +230,7 @@ In Nick's patch, he used one of the struct page LRU list link fields as a count of VM_LOCKED VMAs that map the page (Rik van Riel had the same idea three years earlier). But this use of the link field for a count prevented the management of the pages on an LRU list, and thus mlocked pages were not migratable as -isolate_lru_page() could not detect them, and the LRU list link field was not +folio_isolate_lru() could not detect them, and the LRU list link field was not available to the migration subsystem. Nick resolved this by putting mlocked pages back on the LRU list before diff --git a/Documentation/translations/zh_CN/mm/page_migration.rst b/Documentation/translations/zh_CN/mm/page_migration.rst index f95063826a15..8c8461c6cb9f 100644 --- a/Documentation/translations/zh_CN/mm/page_migration.rst +++ b/Documentation/translations/zh_CN/mm/page_migration.rst @@ -50,8 +50,8 @@ mbind()设置一个新的内存策略。一个进程的页面也可以通过sys_ 1. 从LRU中移除页面。 - 要迁移的页面列表是通过扫描页面并把它们移到列表中来生成的。这是通过调用 isolate_lru_page() - 来完成的。调用isolate_lru_page()增加了对该页的引用,这样在页面迁移发生时它就不会 + 要迁移的页面列表是通过扫描页面并把它们移到列表中来生成的。这是通过调用 folio_isolate_lru() + 来完成的。调用folio_isolate_lru()增加了对该页的引用,这样在页面迁移发生时它就不会 消失。它还可以防止交换器或其他扫描器遇到该页。 @@ -65,7 +65,7 @@ migrate_pages()如何工作 ======================= migrate_pages()对它的页面列表进行了多次处理。如果当时对一个页面的所有引用都可以被移除, -那么这个页面就会被移动。该页已经通过isolate_lru_page()从LRU中移除,并且refcount被 +那么这个页面就会被移动。该页已经通过folio_isolate_lru()从LRU中移除,并且refcount被 增加,以便在页面迁移发生时不释放该页。 步骤: diff --git a/mm/filemap.c b/mm/filemap.c index 070dee9791a9..88a2ed008474 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -114,7 +114,7 @@ * ->private_lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one) * ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) - * ->lruvec->lru_lock (check_pte_range->isolate_lru_page) + * ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) * ->private_lock (folio_remove_rmap_pte->set_page_dirty) * ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) * bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) diff --git a/mm/folio-compat.c b/mm/folio-compat.c index f05906006b3c..47b8cc245aa4 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -93,13 +93,6 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, } EXPORT_SYMBOL(grab_cache_page_write_begin); -bool isolate_lru_page(struct page *page) -{ - if (WARN_RATELIMIT(PageTail(page), "trying to isolate tail page")) - return false; - return folio_isolate_lru((struct folio *)page); -} - void putback_lru_page(struct page *page) { folio_putback_lru(page_folio(page)); diff --git a/mm/internal.h b/mm/internal.h index 2f995f91a088..b456544dc05c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -416,7 +416,6 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ -bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4a83c40d9053..ab646018ce25 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -627,8 +627,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, } /* - * We can do it before isolate_lru_page because the - * page can't be freed from under us. NOTE: PG_lock + * We can do it before folio_isolate_lru because the + * folio can't be freed from under us. NOTE: PG_lock * is needed to serialize against split_huge_page * when invoked from the VM. */ @@ -1874,7 +1874,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, result = SCAN_FAIL; goto xa_unlocked; } - /* drain lru cache to help isolate_lru_page() */ + /* drain lru cache to help folio_isolate_lru() */ lru_add_drain(); } else if (folio_trylock(folio)) { folio_get(folio); @@ -1889,7 +1889,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, page_cache_sync_readahead(mapping, &file->f_ra, file, index, end - index); - /* drain lru cache to help isolate_lru_page() */ + /* drain lru cache to help folio_isolate_lru() */ lru_add_drain(); folio = filemap_lock_folio(mapping, index); if (IS_ERR(folio)) { diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 6ea3d055f520..8d687de88a03 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -328,8 +328,8 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page) /* * One extra ref because caller holds an extra reference, either from - * isolate_lru_page() for a regular page, or migrate_vma_collect() for - * a device page. + * folio_isolate_lru() for a regular folio, or migrate_vma_collect() for + * a device folio. */ int extra = 1 + (page == fault_page); diff --git a/mm/swap.c b/mm/swap.c index 6b838986d294..835bdf324b76 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -906,8 +906,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0); /* * lru_cache_disable() needs to be called before we start compiling - * a list of pages to be migrated using isolate_lru_page(). - * It drains pages on LRU cache and then disable on all cpus until + * a list of folios to be migrated using folio_isolate_lru(). + * It drains folios on LRU cache and then disable on all cpus until * lru_cache_enable is called. * * Must be paired with a call to lru_cache_enable(). From 24f937796c1ac8d16947495c39ffff416f7125ee Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Mon, 26 Aug 2024 14:58:14 +0800 Subject: [PATCH 304/416] mm: remove putback_lru_page() There are no more callers of putback_lru_page(), remove it. Link: https://lkml.kernel.org/r/20240826065814.1336616-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: David Hildenbrand Reviewed-by: Vishal Moola (Oracle) Cc: Alistair Popple Cc: Baolin Wang Cc: Jonathan Corbet Cc: Matthew Wilcox (Oracle) Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/folio-compat.c | 5 ----- mm/internal.h | 1 - 2 files changed, 6 deletions(-) diff --git a/mm/folio-compat.c b/mm/folio-compat.c index 47b8cc245aa4..80746182e9e8 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -92,8 +92,3 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, mapping_gfp_mask(mapping)); } EXPORT_SYMBOL(grab_cache_page_write_begin); - -void putback_lru_page(struct page *page) -{ - folio_putback_lru(page_folio(page)); -} diff --git a/mm/internal.h b/mm/internal.h index b456544dc05c..44c8dec1f0d7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -417,7 +417,6 @@ extern unsigned long highest_memmap_pfn; * in mm/vmscan.c: */ bool folio_isolate_lru(struct folio *folio); -void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); From b7315fbb64736acad3cd33851884b222b00dc0f4 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 21:23:20 -0700 Subject: [PATCH 305/416] mm/damon/core: introduce per-context region priorities histogram buffer Patch series "replace per-quota region priorities histogram buffer with per-context one". Each DAMOS quota (struct damos_quota) maintains a histogram for total regions size per its prioritization score. DAMOS calcultes minimum prioritization score of regions that are ok to apply the DAMOS action to while respecting the quota. The histogram is constructed only for the calculation of the minimum score in damos_adjust_quota() for each quota which called by kdamond_fn(). Hence, there is no real reason to have per-quota histogram. Only per-kdamond histogram is needed, since parallel kdamonds could have races otherwise. The current implementation is only wasting the memory, and can easily cause unintended stack usage[1]. So, introducing a per-kdamond histogram and replacing the per-quota one with it would be the right solution for the issue. However, supporting multiple DAMON contexts per kdamond is still an ongoing work[2] without a clear estimated time of arrival. Meanwhile, per-context histogram could be an effective and straightforward solution having no blocker. Let's fix the problem first in the way. This patch (of 4): Introduce per-context buffer for region priority scores-total size histogram. Same to the per-quota one (->histogram of struct damos_quota), the new buffer is hidden from DAMON API users by being defined as a private field of DAMON context structure. It is dynamically allocated and de-allocated at the beginning and ending of the execution of the kdamond by kdamond_fn() itself. [1] commit 0742cadf5e4c ("mm/damon/lru_sort: adjust local variable to dynamic allocation") [2] https://lore.kernel.org/20240531122320.909060-1-yorha.op@gmail.com Link: https://lkml.kernel.org/r/20240826042323.87025-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826042323.87025-2-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 ++ mm/damon/core.c | 5 +++++ 2 files changed, 7 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 27c546bfc6d4..60cb8eada3d6 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -630,6 +630,8 @@ struct damon_ctx { unsigned long next_ops_update_sis; /* for waiting until the execution of the kdamond_fn is started */ struct completion kdamond_started; + /* for scheme quotas prioritization */ + unsigned long *regions_score_histogram; /* public: */ struct task_struct *kdamond; diff --git a/mm/damon/core.c b/mm/damon/core.c index 086294279b07..40266d85eebf 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1957,6 +1957,10 @@ static int kdamond_fn(void *data) ctx->ops.init(ctx); if (ctx->callback.before_start && ctx->callback.before_start(ctx)) goto done; + ctx->regions_score_histogram = kmalloc_array(DAMOS_MAX_SCORE + 1, + sizeof(*ctx->regions_score_histogram), GFP_KERNEL); + if (!ctx->regions_score_histogram) + goto done; sz_limit = damon_region_sz_limit(ctx); @@ -2034,6 +2038,7 @@ done: ctx->callback.before_terminate(ctx); if (ctx->ops.cleanup) ctx->ops.cleanup(ctx); + kfree(ctx->regions_score_histogram); pr_debug("kdamond (%d) finishes\n", current->pid); mutex_lock(&ctx->kdamond_lock); From 304b95847f2852070a692beef10c98f5e36af631 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 21:23:21 -0700 Subject: [PATCH 306/416] mm/damon/core: replace per-quota regions priority histogram buffer usage with per-context one Replace the usage of per-quota region priorities histogram buffer with the per-context one. After this change, the per-quota histogram is not used by anyone, and hence it is ready to be removed. Link: https://lkml.kernel.org/r/20240826042323.87025-3-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/core.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 40266d85eebf..8b99c5a99c38 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -1582,13 +1582,16 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) return; /* Fill up the score histogram */ - memset(quota->histogram, 0, sizeof(quota->histogram)); + memset(c->regions_score_histogram, 0, + sizeof(*c->regions_score_histogram) * + (DAMOS_MAX_SCORE + 1)); damon_for_each_target(t, c) { damon_for_each_region(r, t) { if (!__damos_valid_target(r, s)) continue; score = c->ops.get_scheme_score(c, t, r, s); - quota->histogram[score] += damon_sz_region(r); + c->regions_score_histogram[score] += + damon_sz_region(r); if (score > max_score) max_score = score; } @@ -1596,7 +1599,7 @@ static void damos_adjust_quota(struct damon_ctx *c, struct damos *s) /* Set the min score limit */ for (cumulated_sz = 0, score = max_score; ; score--) { - cumulated_sz += quota->histogram[score]; + cumulated_sz += c->regions_score_histogram[score]; if (cumulated_sz >= quota->esz || !score) break; } From e3bcb1672583f2e1cf456e4303ce562a3cc775c0 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 21:23:22 -0700 Subject: [PATCH 307/416] mm/damon/core: remove per-scheme region priority histogram buffer Nobody is reading from or writing to the per-scheme region priorities histogram buffer. It is only wasting memory. Remove it. Link: https://lkml.kernel.org/r/20240826042323.87025-4-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index 60cb8eada3d6..a67f2c4940e9 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -233,7 +233,6 @@ struct damos_quota { unsigned long charge_addr_from; /* For prioritization */ - unsigned long histogram[DAMOS_MAX_SCORE + 1]; unsigned int min_score; /* For feedback loop */ From 2986846437e21dcfdd7531d50a0020e98087b586 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 21:23:23 -0700 Subject: [PATCH 308/416] Revert "mm/damon/lru_sort: adjust local variable to dynamic allocation" This reverts commit 0742cadf5e4c ("mm/damon/lru_sort: adjust local variable to dynamic allocation"). The commit was introduced to avoid unnecessary usage of stack memory for per-scheme region priorities histogram buffer. The fix is nice, but the point of the fix looks not very clear if the commit message is not read together. That's mainly because the buffer is a private field, which means it is hidden from the DAMON API users. That's not the fault of the fix but the underlying data structure. Now the per-scheme histogram buffer is gone, so the problem that the commit was fixing is also removed. The use of kmemdup() has no more point but just making the code bit difficult to understand. Revert the fix. Link: https://lkml.kernel.org/r/20240826042323.87025-5-sj@kernel.org Signed-off-by: SeongJae Park Signed-off-by: Andrew Morton --- mm/damon/lru_sort.c | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 0b35bd5fb659..4af8fd4a390b 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -148,17 +148,12 @@ static struct damon_target *target; static struct damos *damon_lru_sort_new_scheme( struct damos_access_pattern *pattern, enum damos_action action) { - struct damos *damos; - struct damos_quota *quota = kmemdup(&damon_lru_sort_quota, - sizeof(damon_lru_sort_quota), GFP_KERNEL); - - if (!quota) - return NULL; + struct damos_quota quota = damon_lru_sort_quota; /* Use half of total quota for hot/cold pages sorting */ - quota->ms = quota->ms / 2; + quota.ms = quota.ms / 2; - damos = damon_new_scheme( + return damon_new_scheme( /* find the pattern, and */ pattern, /* (de)prioritize on LRU-lists */ @@ -166,12 +161,10 @@ static struct damos *damon_lru_sort_new_scheme( /* for each aggregation interval */ 0, /* under the quota. */ - quota, + "a, /* (De)activate this according to the watermarks. */ &damon_lru_sort_wmarks, NUMA_NO_NODE); - kfree(quota); - return damos; } /* Create a DAMON-based operation scheme for hot memory regions */ From 23a425aab05f6d7da1f787b7f2c8d02d91871ca3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 18:57:39 -0700 Subject: [PATCH 309/416] Docs/damon: use damonitor GitHub organization instead of awslabs Patch series "Docs/damon: update GitHub repo URLs and maintainer-profile". Replace GitHub URLS on DAMON documents for none-kernel parts DAMON repos with new ones[1] via the first patch. With following two patches, wordsmith maitnainer-profile for better readability, and document the Google clendsar for bi-weekly meetups, respectively. [1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org This patch (of 3): GitHub repos for non-kernel parts of DAMON project including 'damo', 'damon-tests' and 'damoos' will be moved[1] from 'awslabs' org to 'damonitor', by 2024-09-05. Update related URLs in kernel tree. [1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826015741.80707-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826015741.80707-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/start.rst | 4 ++-- Documentation/admin-guide/mm/damon/usage.rst | 8 ++++---- Documentation/mm/damon/design.rst | 2 +- Documentation/mm/damon/maintainer-profile.rst | 8 ++++---- .../translations/zh_CN/admin-guide/mm/damon/start.rst | 4 ++-- .../translations/zh_CN/admin-guide/mm/damon/usage.rst | 8 ++++---- .../translations/zh_TW/admin-guide/mm/damon/start.rst | 4 ++-- .../translations/zh_TW/admin-guide/mm/damon/usage.rst | 8 ++++---- 8 files changed, 23 insertions(+), 23 deletions(-) diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 054010a7f3d8..c4dddf6733cd 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -7,7 +7,7 @@ Getting Started This document briefly describes how you can use DAMON by demonstrating its default user space tool. Please note that this document describes only a part of its features for brevity. Please refer to the usage `doc -`_ of the tool for more +`_ of the tool for more details. @@ -26,7 +26,7 @@ User Space Tool For the demonstration, we will use the default user space tool for DAMON, called DAMON Operator (DAMO). It is available at -https://github.com/awslabs/damo. The examples below assume that ``damo`` is on +https://github.com/damonitor/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. Because DAMO is using the sysfs interface (refer to :doc:`usage` for the diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 26df6cfa4441..d9be9f7caa7d 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -7,19 +7,19 @@ Detailed Usages DAMON provides below interfaces for different users. - *DAMON user space tool.* - `This `_ is for privileged people such as + `This `_ is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for special cases, though. For more detail, please refer to its `usage document - `_. + `_. - *sysfs interface.* :ref:`This ` is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special sysfs files. Therefore, you can write and use your personalized DAMON sysfs wrapper programs that reads/writes the sysfs files instead of you. The `DAMON user space tool - `_ is one example of such programs. + `_ is one example of such programs. - *Kernel Space Programming Interface.* :doc:`This ` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by @@ -543,7 +543,7 @@ memory rate becomes larger than 60%, or lower than 30%". :: # echo 300 > watermarks/low Please note that it's highly recommended to use user space tools like `damo -`_ rather than manually reading and writing +`_ rather than manually reading and writing the files as above. Above is only for an example. .. _tracepoint: diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 8730c246ceaa..f9c50525bdbf 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -586,7 +586,7 @@ API, and return the results to the user-space. The ABIs are designed to be used for user space applications development, rather than human beings' fingers. Human users are recommended to use such user space tools. One such Python-written user space tool is available at -Github (https://github.com/awslabs/damo), Pypi +Github (https://github.com/damonitor/damo), Pypi (https://pypistats.org/packages/damo), and Fedora (https://packages.fedoraproject.org/pkgs/python-damo/damo/). diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index feccf6a0f6c3..048b78e6d989 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -91,9 +91,9 @@ list (damon@lists.linux.dev). .. [1] https://git.kernel.org/akpm/mm/h/mm-unstable .. [2] https://git.kernel.org/sj/h/damon/next .. [3] https://git.kernel.org/akpm/mm/h/mm-stable -.. [4] https://github.com/awslabs/damon-tests/blob/master/corr/run.sh#L49 -.. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh -.. [6] https://github.com/awslabs/damon-tests/tree/master/corr -.. [7] https://github.com/awslabs/damon-tests/tree/master/perf +.. [4] https://github.com/damonitor/damon-tests/blob/master/corr/run.sh#L49 +.. [5] https://github.com/damonitor/damon-tests/blob/master/corr/tests/kunit.sh +.. [6] https://github.com/damonitor/damon-tests/tree/master/corr +.. [7] https://github.com/damonitor/damon-tests/tree/master/perf .. [8] https://github.com/damonitor/hackermail .. [9] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst index bf21ff84f396..cff7b6f98c59 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/start.rst @@ -15,7 +15,7 @@ 本文通过演示DAMON的默认用户空间工具,简要地介绍了如何使用DAMON。请注意,为了简洁 起见,本文档只描述了它的部分功能。更多细节请参考该工具的使用文档。 -`doc `_ . +`doc `_ . 前提条件 @@ -31,7 +31,7 @@ ------------ 在演示中,我们将使用DAMON的默认用户空间工具,称为DAMON Operator(DAMO)。它可以在 -https://github.com/awslabs/damo找到。下面的例子假设DAMO在你的$PATH上。当然,但 +https://github.com/damonitor/damo找到。下面的例子假设DAMO在你的$PATH上。当然,但 这并不是强制性的。 因为DAMO使用了DAMON的sysfs接口(详情请参考:doc:`usage`),你应该确保 diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst index da2745464ece..50f6f0b6bf11 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst @@ -16,16 +16,16 @@ DAMON 为不同的用户提供了下面这些接口。 - *DAMON用户空间工具。* - `这 `_ 为有这特权的人, 如系统管理员,希望有一个刚好 + `这 `_ 为有这特权的人, 如系统管理员,希望有一个刚好 可以工作的人性化界面。 使用它,用户可以以人性化的方式使用DAMON的主要功能。不过,它可能不会为特殊情况进行高度调整。 它同时支持虚拟和物理地址空间的监测。更多细节,请参考它的 `使用文档 - `_。 + `_。 - *sysfs接口。* :ref:`这 ` 是为那些希望更高级的使用DAMON的特权用户空间程序员准备的。 使用它,用户可以通过读取和写入特殊的sysfs文件来使用DAMON的主要功能。因此,你可以编写和使 用你个性化的DAMON sysfs包装程序,代替你读/写sysfs文件。 `DAMON用户空间工具 - `_ 就是这种程序的一个例子 它同时支持虚拟和物理地址 + `_ 就是这种程序的一个例子 它同时支持虚拟和物理地址 空间的监测。注意,这个界面只提供简单的监测结果 :ref:`统计 `。对于详细的监测 结果,DAMON提供了一个:ref:`跟踪点 `。 - *debugfs interface.* @@ -332,7 +332,7 @@ tried_regions// # echo 500 > watermarks/mid # echo 300 > watermarks/low -请注意,我们强烈建议使用用户空间的工具,如 `damo `_ , +请注意,我们强烈建议使用用户空间的工具,如 `damo `_ , 而不是像上面那样手动读写文件。以上只是一个例子。 debugfs接口 diff --git a/Documentation/translations/zh_TW/admin-guide/mm/damon/start.rst b/Documentation/translations/zh_TW/admin-guide/mm/damon/start.rst index 1822956be0e0..57d36bfbb1b3 100644 --- a/Documentation/translations/zh_TW/admin-guide/mm/damon/start.rst +++ b/Documentation/translations/zh_TW/admin-guide/mm/damon/start.rst @@ -15,7 +15,7 @@ 本文通過演示DAMON的默認用戶空間工具,簡要地介紹瞭如何使用DAMON。請注意,爲了簡潔 起見,本文檔只描述了它的部分功能。更多細節請參考該工具的使用文檔。 -`doc `_ . +`doc `_ . 前提條件 @@ -31,7 +31,7 @@ ------------ 在演示中,我們將使用DAMON的默認用戶空間工具,稱爲DAMON Operator(DAMO)。它可以在 -https://github.com/awslabs/damo找到。下面的例子假設DAMO在你的$PATH上。當然,但 +https://github.com/damonitor/damo找到。下面的例子假設DAMO在你的$PATH上。當然,但 這並不是強制性的。 因爲DAMO使用了DAMON的sysfs接口(詳情請參考:doc:`usage`),你應該確保 diff --git a/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst index 7464279f9b7d..fbbbbad59ee4 100644 --- a/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst +++ b/Documentation/translations/zh_TW/admin-guide/mm/damon/usage.rst @@ -16,16 +16,16 @@ DAMON 爲不同的用戶提供了下面這些接口。 - *DAMON用戶空間工具。* - `這 `_ 爲有這特權的人, 如系統管理員,希望有一個剛好 + `這 `_ 爲有這特權的人, 如系統管理員,希望有一個剛好 可以工作的人性化界面。 使用它,用戶可以以人性化的方式使用DAMON的主要功能。不過,它可能不會爲特殊情況進行高度調整。 它同時支持虛擬和物理地址空間的監測。更多細節,請參考它的 `使用文檔 - `_。 + `_。 - *sysfs接口。* :ref:`這 ` 是爲那些希望更高級的使用DAMON的特權用戶空間程序員準備的。 使用它,用戶可以通過讀取和寫入特殊的sysfs文件來使用DAMON的主要功能。因此,你可以編寫和使 用你個性化的DAMON sysfs包裝程序,代替你讀/寫sysfs文件。 `DAMON用戶空間工具 - `_ 就是這種程序的一個例子 它同時支持虛擬和物理地址 + `_ 就是這種程序的一個例子 它同時支持虛擬和物理地址 空間的監測。注意,這個界面只提供簡單的監測結果 :ref:`統計 `。對於詳細的監測 結果,DAMON提供了一個:ref:`跟蹤點 `。 - *debugfs interface.* @@ -332,7 +332,7 @@ tried_regions// # echo 500 > watermarks/mid # echo 300 > watermarks/low -請注意,我們強烈建議使用用戶空間的工具,如 `damo `_ , +請注意,我們強烈建議使用用戶空間的工具,如 `damo `_ , 而不是像上面那樣手動讀寫文件。以上只是一個例子。 debugfs接口 From 2e9b3d6e2e59934a315d2d14b805ea0a2f287426 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 18:57:40 -0700 Subject: [PATCH 310/416] Docs/damon/maintainer-profile: add links in place maintainer-profile.rst for DAMON separates the links and target definitions. It is not really necessary, and only makes the readability worse. At least the definitions need the section title (say, "References"). Just add the links in place on the doc. Link: https://lkml.kernel.org/r/20240826015741.80707-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/mm/damon/maintainer-profile.rst | 80 ++++++++++--------- 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index 048b78e6d989..3c1b42b062ea 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -7,23 +7,27 @@ The DAMON subsystem covers the files that are listed in 'DATA ACCESS MONITOR' section of 'MAINTAINERS' file. The mailing lists for the subsystem are damon@lists.linux.dev and -linux-mm@kvack.org. Patches should be made against the mm-unstable tree [1]_ -whenever possible and posted to the mailing lists. +linux-mm@kvack.org. Patches should be made against the mm-unstable `tree +` whenever possible and posted to +the mailing lists. SCM Trees --------- There are multiple Linux trees for DAMON development. Patches under -development or testing are queued in damon/next [2]_ by the DAMON maintainer. -Sufficiently reviewed patches will be queued in mm-unstable [1]_ by the memory -management subsystem maintainer. After more sufficient tests, the patches will -be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the -memory management subsystem maintainer. +development or testing are queued in `damon/next +` by the DAMON maintainer. +Sufficiently reviewed patches will be queued in `mm-unstable +` by the memory management +subsystem maintainer. After more sufficient tests, the patches will be queued +in `mm-stable ` , and finally +pull-requested to the mainline by the memory management subsystem maintainer. -Note again the patches for mm-unstable tree [1]_ are queued by the memory +Note again the patches for mm-unstable `tree +` are queued by the memory management subsystem maintainer. If the patches requires some patches in -damon/next tree [2]_ which not yet merged in mm-unstable, please make sure the -requirement is clearly specified. +damon/next `tree ` which not yet merged +in mm-unstable, please make sure the requirement is clearly specified. Submit checklist addendum ------------------------- @@ -32,18 +36,27 @@ When making DAMON changes, you should do below. - Build changes related outputs including kernel and documents. - Ensure the builds introduce no new errors or warnings. -- Run and ensure no new failures for DAMON selftests [4]_ and kunittests [5]_ . +- Run and ensure no new failures for DAMON `selftests + ` and + `kunittests + `. Further doing below and putting the results will be helpful. -- Run damon-tests/corr [6]_ for normal changes. -- Run damon-tests/perf [7]_ for performance changes. +- Run `damon-tests/corr + ` for normal + changes. +- Run `damon-tests/perf + ` for performance + changes. Key cycle dates --------------- -Patches can be sent anytime. Key cycle dates of the mm-unstable [1]_ and -mm-stable [3]_ trees depend on the memory management subsystem maintainer. +Patches can be sent anytime. Key cycle dates of the `mm-unstable +` and `mm-stable +` trees depend on the memory +management subsystem maintainer. Review cadence -------------- @@ -58,16 +71,17 @@ Mailing tool Like many other Linux kernel subsystems, DAMON uses the mailing lists (damon@lists.linux.dev and linux-mm@kvack.org) as the major communication -channel. There is a simple tool called HacKerMaiL (``hkml``) [8]_ , which is -for people who are not very familiar with the mailing lists based -communication. The tool could be particularly helpful for DAMON community -members since it is developed and maintained by DAMON maintainer. The tool is -also officially announced to support DAMON and general Linux kernel development -workflow. +channel. There is a simple tool called `HacKerMaiL +` (``hkml``), which is for people who +are not very familiar with the mailing lists based communication. The tool +could be particularly helpful for DAMON community members since it is developed +and maintained by DAMON maintainer. The tool is also officially announced to +support DAMON and general Linux kernel development workflow. -In other words, ``hkml`` [8]_ is a mailing tool for DAMON community, which -DAMON maintainer is committed to support. Please feel free to try and report -issues or feature requests for the tool to the maintainer. +In other words, `hkml ` is a mailing +tool for DAMON community, which DAMON maintainer is committed to support. +Please feel free to try and report issues or feature requests for the tool to +the maintainer. Community meetup ---------------- @@ -83,17 +97,7 @@ members including the maintainer. The maintainer shares the available time slots, and attendees should reserve one of those at least 24 hours before the time slot, by reaching out to the maintainer. -Schedules and available reservation time slots are available at the Google doc -[9]_ . DAMON maintainer will also provide periodic reminder to the mailing -list (damon@lists.linux.dev). - - -.. [1] https://git.kernel.org/akpm/mm/h/mm-unstable -.. [2] https://git.kernel.org/sj/h/damon/next -.. [3] https://git.kernel.org/akpm/mm/h/mm-stable -.. [4] https://github.com/damonitor/damon-tests/blob/master/corr/run.sh#L49 -.. [5] https://github.com/damonitor/damon-tests/blob/master/corr/tests/kunit.sh -.. [6] https://github.com/damonitor/damon-tests/tree/master/corr -.. [7] https://github.com/damonitor/damon-tests/tree/master/perf -.. [8] https://github.com/damonitor/hackermail -.. [9] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing +Schedules and available reservation time slots are available at the Google `doc +`. +DAMON maintainer will also provide periodic reminder to the mailing list +(damon@lists.linux.dev). From e9c0bfd704e360eb151ca2d65229fbe425c563e7 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Sun, 25 Aug 2024 18:57:41 -0700 Subject: [PATCH 311/416] Docs/damon/maintainer-profile: document Google calendar for bi-weekly meetups We added a public Google calendar for easy sharing of DAMON bi-weekly meetups[1]. Add it to the official document for a better visibility. [1] https://lore.kernel.org/all/20240717235812.53087-1-sj@kernel.org/ Link: https://lkml.kernel.org/r/20240826015741.80707-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Alex Shi Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet Cc: Yanteng Si Signed-off-by: Andrew Morton --- Documentation/mm/damon/maintainer-profile.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst index 3c1b42b062ea..2365c9a3c1f0 100644 --- a/Documentation/mm/damon/maintainer-profile.rst +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -99,5 +99,7 @@ time slot, by reaching out to the maintainer. Schedules and available reservation time slots are available at the Google `doc `. -DAMON maintainer will also provide periodic reminder to the mailing list -(damon@lists.linux.dev). +There is also a public Google `calendar +` +that has the events. Anyone can subscribe it. DAMON maintainer will also +provide periodic reminder to the mailing list (damon@lists.linux.dev). From 81528310698726e8f79b9c3de01ebe02567119dd Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 26 Aug 2024 01:24:21 +0000 Subject: [PATCH 312/416] maple_tree: arange64 node is not a leaf node mt_dump_arange64() only applies to an entry whose type is maple_arange_64, in which mte_is_leaf() must return false. Since mte_is_leaf() here is always false, we can remove this condition check. Link: https://lkml.kernel.org/r/20240826012422.29935-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Liam R. Howlett Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- lib/maple_tree.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 755ba8b18e14..f7a971c462eb 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -7203,7 +7203,6 @@ static void mt_dump_arange64(const struct maple_tree *mt, void *entry, enum mt_dump_format format) { struct maple_arange_64 *node = &mte_to_node(entry)->ma64; - bool leaf = mte_is_leaf(entry); unsigned long first = min; int i; @@ -7237,10 +7236,7 @@ static void mt_dump_arange64(const struct maple_tree *mt, void *entry, break; if (last == 0 && i > 0) break; - if (leaf) - mt_dump_entry(mt_slot(mt, node->slot, i), - first, last, depth + 1, format); - else if (node->slot[i]) + if (node->slot[i]) mt_dump_node(mt, mt_slot(mt, node->slot, i), first, last, depth + 1, format); From 21a449bedf3f01ea30ed657deeb2e8942ad45936 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 26 Aug 2024 01:24:22 +0000 Subject: [PATCH 313/416] maple_tree: dump error message based on format Just do what mt_dump_range64() does. Dump the error message based on format. Link: https://lkml.kernel.org/r/20240826012422.29935-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang Cc: Liam R. Howlett Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- lib/maple_tree.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index f7a971c462eb..8a3c2c344f62 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -7243,9 +7243,15 @@ static void mt_dump_arange64(const struct maple_tree *mt, void *entry, if (last == max) break; if (last > max) { - pr_err("node %p last (%lu) > max (%lu) at pivot %d!\n", + switch(format) { + case mt_dump_hex: + pr_err("node %p last (%lx) > max (%lx) at pivot %d!\n", node, last, max, i); - break; + break; + case mt_dump_dec: + pr_err("node %p last (%lu) > max (%lu) at pivot %d!\n", + node, last, max, i); + } } first = last + 1; } From de5b85262e2038a5ae5d281ddf43d35acb2bfa60 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 25 Aug 2024 15:42:45 -0700 Subject: [PATCH 314/416] mm: shmem: fix minor off-by-one in shrinkable calculation There has been a long-standing and very minor off-by-one, where shmem_get_folio_gfp() decides if a large folio extends beyond i_size far enough to leave a page or more for freeing later under pressure. This is not something needed for stable: but it will be proportionately more significant as support for smaller large folios is added, and is best fixed before duplicating the check in other places. Link: https://lkml.kernel.org/r/d8e75079-af2d-8519-56df-6be1dccc247a@google.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins Reviewed-by: David Hildenbrand Reviewed-by: Baolin Wang Signed-off-by: Andrew Morton --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 800cec9dc534..bf50ecb906ce 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2348,7 +2348,7 @@ alloced: alloced = true; if (folio_test_large(folio) && DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < - folio_next_index(folio) - 1) { + folio_next_index(folio)) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); struct shmem_inode_info *info = SHMEM_I(inode); /* From 15444054a537aca115bb077a77e99a9cc5ae11e6 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 25 Aug 2024 16:25:39 -0700 Subject: [PATCH 315/416] mm: shmem: extend shmem_unused_huge_shrink() to all sizes Although shmem_get_folio_gfp() is correctly putting inodes on the shrinklist according to the folio size, shmem_unused_huge_shrink() was still dealing with that shrinklist in terms of HPAGE_PMD_SIZE. Generalize that; and to handle the mixture of sizes more sensibly, shmem_alloc_and_add_folio() give it a number of pages to be freed (approximate: no need to minimize that with an exact calculation) instead of a number of inodes to split. [akpm@linux-foundation.org: comment tweak, per David] Link: https://lkml.kernel.org/r/d8c40850-6774-7a93-1e2c-8d941683b260@google.com Signed-off-by: Hugh Dickins Reviewed-by: David Hildenbrand Cc: Baolin Wang Cc: Hugh Dickins Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton --- mm/shmem.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index bf50ecb906ce..553b99cb265e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -636,15 +636,14 @@ static const char *shmem_format_huge(int huge) #endif static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, - struct shrink_control *sc, unsigned long nr_to_split) + struct shrink_control *sc, unsigned long nr_to_free) { LIST_HEAD(list), *pos, *next; - LIST_HEAD(to_remove); struct inode *inode; struct shmem_inode_info *info; struct folio *folio; unsigned long batch = sc ? sc->nr_to_scan : 128; - int split = 0; + unsigned long split = 0, freed = 0; if (list_empty(&sbinfo->shrinklist)) return SHRINK_STOP; @@ -662,13 +661,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, goto next; } - /* Check if there's anything to gain */ - if (round_up(inode->i_size, PAGE_SIZE) == - round_up(inode->i_size, HPAGE_PMD_SIZE)) { - list_move(&info->shrinklist, &to_remove); - goto next; - } - list_move(&info->shrinklist, &list); next: sbinfo->shrinklist_len--; @@ -677,34 +669,36 @@ next: } spin_unlock(&sbinfo->shrinklist_lock); - list_for_each_safe(pos, next, &to_remove) { - info = list_entry(pos, struct shmem_inode_info, shrinklist); - inode = &info->vfs_inode; - list_del_init(&info->shrinklist); - iput(inode); - } - list_for_each_safe(pos, next, &list) { + pgoff_t next, end; + loff_t i_size; int ret; - pgoff_t index; info = list_entry(pos, struct shmem_inode_info, shrinklist); inode = &info->vfs_inode; - if (nr_to_split && split >= nr_to_split) + if (nr_to_free && freed >= nr_to_free) goto move_back; - index = (inode->i_size & HPAGE_PMD_MASK) >> PAGE_SHIFT; - folio = filemap_get_folio(inode->i_mapping, index); - if (IS_ERR(folio)) + i_size = i_size_read(inode); + folio = filemap_get_entry(inode->i_mapping, i_size / PAGE_SIZE); + if (!folio || xa_is_value(folio)) goto drop; - /* No huge page at the end of the file: nothing to split */ + /* No large folio at the end of the file: nothing to split */ if (!folio_test_large(folio)) { folio_put(folio); goto drop; } + /* Check if there is anything to gain from splitting */ + next = folio_next_index(folio); + end = shmem_fallocend(inode, DIV_ROUND_UP(i_size, PAGE_SIZE)); + if (end <= folio->index || end >= next) { + folio_put(folio); + goto drop; + } + /* * Move the inode on the list back to shrinklist if we failed * to lock the page at this time. @@ -725,6 +719,7 @@ next: if (ret) goto move_back; + freed += next - end; split++; drop: list_del_init(&info->shrinklist); @@ -769,7 +764,7 @@ static long shmem_unused_huge_count(struct super_block *sb, #define shmem_huge SHMEM_HUGE_DENY static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, - struct shrink_control *sc, unsigned long nr_to_split) + struct shrink_control *sc, unsigned long nr_to_free) { return 0; } @@ -1851,7 +1846,7 @@ allocated: * Try to reclaim some space by splitting a few * large folios beyond i_size on the filesystem. */ - shmem_unused_huge_shrink(sbinfo, NULL, 2); + shmem_unused_huge_shrink(sbinfo, NULL, pages); /* * And do a shmem_recalc_inode() to account for freed pages: * except our folio is there in cache, so not quite balanced. From 83362d223762fcb4048f135423862c760aa2780f Mon Sep 17 00:00:00 2001 From: Mateusz Guzik Date: Wed, 28 Aug 2024 18:07:04 +0200 Subject: [PATCH 316/416] mm/hugetlb: sort out global lock annotations The mutex array pointer shares a cacheline with the spinlock: ffffffff84187480 B hugetlb_fault_mutex_table ffffffff84187488 B hugetlb_lock This is because the former is annotated with a macro forcing cacheline alignment. I suspect it was meant to be the variant which on top of it makes sure the object does not share the cacheline with anyone. Since array pointer itself is de facto read-only such an annotation does not make sense there anyway. Instead mark it __ro_after_init along with the size var. Do however move the spinlock out of the way. [akpm@linux-foundation.org: move section directives to the end of the definitions, per convention] [akpm@linux-foundation.org: DEFINE_SPINLOCK doesn't permit section modifiers at end-of-definition] Link: https://lkml.kernel.org/r/20240828160704.1425767-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik Cc: Davidlohr Bueso Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4461d27f7453..3faf5aad142d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -72,14 +72,14 @@ static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata; * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages, * free_huge_pages, and surplus_huge_pages. */ -DEFINE_SPINLOCK(hugetlb_lock); +__cacheline_aligned_in_smp DEFINE_SPINLOCK(hugetlb_lock); /* * Serializes faults on the same logical page. This is used to * prevent spurious OOMs when the hugepage pool is fully utilized. */ -static int num_fault_mutexes; -struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; +static int num_fault_mutexes __ro_after_init; +struct mutex *hugetlb_fault_mutex_table __ro_after_init; /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); From 955abe0a1b41de5ba61fe4cd614ebc123084d499 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Sat, 31 Aug 2024 08:28:21 +1200 Subject: [PATCH 317/416] vduse: avoid using __GFP_NOFAIL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn", v4. __GFP_NOFAIL carries the semantics of never failing, so its callers do not check the return value: %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless. However, __GFP_NOFAIL can sometimes fail if it exceeds size limits or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context. This patchset handles illegal using __GFP_NOFAIL together with GFP_ATOMIC lacking __GFP_DIRECT_RECLAIM(without this, we can't do anything to reclaim memory to satisfy the nofail requirement) and improve related document and warnings. The proper size limits for __GFP_NOFAIL will be handled separately after more discussions. This patch (of 3): mm doesn't support non-blockable __GFP_NOFAIL allocation. Because persisting in providing __GFP_NOFAIL services for non-block users who cannot perform direct memory reclaim may only result in an endless busy loop. Therefore, in such cases, the current mm-core may directly return a NULL pointer: static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { ... if (gfp_mask & __GFP_NOFAIL) { /* * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; ... } ... fail: warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: return page; } Unfortuantely, vpda does that nofail allocation under non-sleepable lock. A possible way to fix that is to move the pages allocation out of the lock into the caller, but having to allocate a huge number of pages and auxiliary page array seems to be problematic as well per Tetsuon: " You should implement proper error handling instead of using __GFP_NOFAIL if count can become large." So I chose another way, which does not release kernel bounce pages when user tries to register userspace bounce pages. Then we can avoid allocating in paths where failure is not expected.(e.g in the release). We pay this for more memory usage as we don't release kernel bounce pages but further optimizations could be done on top. [v-songbaohua@oppo.com: Refine the changelog] Link: https://lkml.kernel.org/r/20240830202823.21478-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240830202823.21478-2-21cnbao@gmail.com Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer") Signed-off-by: Barry Song Reviewed-by: Xie Yongji Tested-by: Xie Yongji Signed-off-by: Jason Wang Cc: Christoph Hellwig Cc: Christoph Lameter Cc: David Hildenbrand Cc: David Rientjes Cc: Hailong.Liu Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim Cc: Linus Torvalds Cc: Michal Hocko Cc: Pekka Enberg Cc: Roman Gushchin Cc: Uladzislau Rezki (Sony) Cc: Vlastimil Babka Cc: Yafang Shao Cc: Christoph Hellwig Cc: Davidlohr Bueso Cc: "Eugenio Pérez" Cc: Kees Cook Cc: Lorenzo Stoakes Cc: Maxime Coquelin Cc: "Michael S. Tsirkin" Cc: Xuan Zhuo Signed-off-by: Andrew Morton --- drivers/vdpa/vdpa_user/iova_domain.c | 19 +++++++++++-------- drivers/vdpa/vdpa_user/iova_domain.h | 1 + 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 791d38d6284c..58116f89d8da 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain, enum dma_data_direction dir) { struct vduse_bounce_map *map; + struct page *page; unsigned int offset; void *addr; size_t sz; @@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain, map->orig_phys == INVALID_PHYS_ADDR)) return; - addr = kmap_local_page(map->bounce_page); + page = domain->user_bounce_pages ? + map->user_bounce_page : map->bounce_page; + + addr = kmap_local_page(page); do_bounce(map->orig_phys + offset, addr + offset, sz, dir); kunmap_local(addr); size -= sz; @@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain, memcpy_to_page(pages[i], 0, page_address(map->bounce_page), PAGE_SIZE); - __free_page(map->bounce_page); } - map->bounce_page = pages[i]; + map->user_bounce_page = pages[i]; get_page(pages[i]); } domain->user_bounce_pages = true; @@ -297,17 +300,17 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain) struct page *page = NULL; map = &domain->bounce_maps[i]; - if (WARN_ON(!map->bounce_page)) + if (WARN_ON(!map->user_bounce_page)) continue; /* Copy user page to kernel page if it's in use */ if (map->orig_phys != INVALID_PHYS_ADDR) { - page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL); + page = map->bounce_page; memcpy_from_page(page_address(page), - map->bounce_page, 0, PAGE_SIZE); + map->user_bounce_page, 0, PAGE_SIZE); } - put_page(map->bounce_page); - map->bounce_page = page; + put_page(map->user_bounce_page); + map->user_bounce_page = NULL; } domain->user_bounce_pages = false; out: diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h index f92f22a7267d..7f3f0928ec78 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.h +++ b/drivers/vdpa/vdpa_user/iova_domain.h @@ -21,6 +21,7 @@ struct vduse_bounce_map { struct page *bounce_page; + struct page *user_bounce_page; u64 orig_phys; }; From 17d75422604f0b92869aa17cb44f60958212f033 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sat, 31 Aug 2024 08:28:22 +1200 Subject: [PATCH 318/416] mm: document __GFP_NOFAIL must be blockable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Non-blocking allocation with __GFP_NOFAIL is not supported and may still result in NULL pointers (if we don't return NULL, we result in busy-loop within non-sleepable contexts): static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { ... /* * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry */ if (gfp_mask & __GFP_NOFAIL) { /* * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; ... } ... fail: warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: return page; } Highlight this in the documentation of __GFP_NOFAIL so that non-mm subsystems can reject any illegal usage of __GFP_NOFAIL with GFP_ATOMIC, GFP_NOWAIT, etc. Link: https://lkml.kernel.org/r/20240830202823.21478-3-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: Michal Hocko Reviewed-by: Christoph Hellwig Acked-by: Vlastimil Babka Acked-by: Davidlohr Bueso Acked-by: David Hildenbrand Cc: Christoph Lameter Cc: David Rientjes Cc: "Eugenio Pérez" Cc: Hailong.Liu Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang Cc: Joonsoo Kim Cc: Kees Cook Cc: Linus Torvalds Cc: Lorenzo Stoakes Cc: Maxime Coquelin Cc: "Michael S. Tsirkin" Cc: Pekka Enberg Cc: Roman Gushchin Cc: Uladzislau Rezki (Sony) Cc: Xuan Zhuo Cc: Christoph Hellwig Cc: Xie Yongji Cc: Yafang Shao Signed-off-by: Andrew Morton --- include/linux/gfp_types.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 313be4ad79fd..4a1fa7706b0c 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -215,7 +215,8 @@ enum { * the caller still has to check for failures) while costly requests try to be * not disruptive and back off even without invoking the OOM killer. * The following three modifiers might be used to override some of these - * implicit rules. + * implicit rules. Please note that all of them must be used along with + * %__GFP_DIRECT_RECLAIM flag. * * %__GFP_NORETRY: The VM implementation will try only very lightweight * memory direct reclaim to get some memory under memory pressure (thus @@ -246,6 +247,8 @@ enum { * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. + * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM. + * It should _never_ be used in non-sleepable contexts. * New users should be evaluated carefully (and the flag should be * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless From 903edea6c53f097f5f0c847fdbbfab0c6c44f241 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Sat, 31 Aug 2024 08:28:23 +1200 Subject: [PATCH 319/416] mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three points for this change: 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is in the hotpath, while others are in less likely scenarios. Moving all warnings to the slowpath will reduce the overhead for order > 1 and increase the visibility of other warnings. 2. We currently have two warnings for order: one for order > 1 in the hotpath and another for order > costly_order in the laziest path. I suggest standardizing on order > 1 since it's been in use for a long time. 3. We don't need to check for __GFP_NOWARN in this case. __GFP_NOWARN is meant to suppress allocation failure reports, but here we're dealing with bug detection, not allocation failures. So replace WARN_ON_ONCE_GFP by WARN_ON_ONCE. [v-songbaohua@oppo.com: also update the doc for __GFP_NOFAIL with order > 1] Link: https://lkml.kernel.org/r/20240903223935.1697-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240830202823.21478-4-21cnbao@gmail.com Signed-off-by: Barry Song Suggested-by: Vlastimil Babka Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Acked-by: Michal Hocko Cc: Christoph Hellwig Cc: Christoph Lameter Cc: Davidlohr Bueso Cc: David Rientjes Cc: "Eugenio Pérez" Cc: Hailong.Liu Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jason Wang Cc: Joonsoo Kim Cc: Kees Cook Cc: Linus Torvalds Cc: Lorenzo Stoakes Cc: Maxime Coquelin Cc: "Michael S. Tsirkin" Cc: Pekka Enberg Cc: Roman Gushchin Cc: Uladzislau Rezki (Sony) Cc: Xie Yongji Cc: Xuan Zhuo Cc: Yafang Shao Signed-off-by: Andrew Morton --- include/linux/gfp_types.h | 3 ++- mm/page_alloc.c | 50 +++++++++++++++++++-------------------- 2 files changed, 27 insertions(+), 26 deletions(-) diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 4a1fa7706b0c..65db9349f905 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -253,7 +253,8 @@ enum { * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. - * Using this flag for costly allocations is _highly_ discouraged. + * Allocating pages from the buddy with __GFP_NOFAIL and order > 1 is + * not supported. Please consider using kvmalloc() instead. */ #define __GFP_IO ((__force gfp_t)___GFP_IO) #define __GFP_FS ((__force gfp_t)___GFP_FS) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3469ca78bf7a..23509b823e2b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3032,12 +3032,6 @@ struct page *rmqueue(struct zone *preferred_zone, { struct page *page; - /* - * We most definitely don't want callers attempting to - * allocate greater than order-1 page units with __GFP_NOFAIL. - */ - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); - if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); @@ -4175,6 +4169,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; bool can_compact = gfp_compaction_allowed(gfp_mask); + bool nofail = gfp_mask & __GFP_NOFAIL; const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; unsigned int alloc_flags; @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int zonelist_iter_cookie; int reserve_flags; + if (unlikely(nofail)) { + /* + * We most definitely don't want callers attempting to + * allocate greater than order-1 page units with __GFP_NOFAIL. + */ + WARN_ON_ONCE(order > 1); + /* + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, + * otherwise, we may result in lockup. + */ + WARN_ON_ONCE(!can_direct_reclaim); + /* + * PF_MEMALLOC request from this context is rather bizarre + * because we cannot reclaim anything and only can loop waiting + * for somebody to do a work for us. + */ + WARN_ON_ONCE(current->flags & PF_MEMALLOC); + } + restart: compaction_retries = 0; no_progress_loops = 0; @@ -4404,29 +4418,15 @@ nopage: * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry */ - if (gfp_mask & __GFP_NOFAIL) { + if (unlikely(nofail)) { /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually require GFP_NOWAIT + * Lacking direct_reclaim we can't do anything to reclaim memory, + * we disregard these unreasonable nofail requests and still + * return NULL */ - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) + if (!can_direct_reclaim) goto fail; - /* - * PF_MEMALLOC request from this context is rather bizarre - * because we cannot reclaim anything and only can loop waiting - * for somebody to do a work for us - */ - WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask); - - /* - * non failing costly orders are a hard requirement which we - * are not prepared for much so let's warn about these users - * so that we can identify them and convert them to something - * else. - */ - WARN_ON_ONCE_GFP(costly_order, gfp_mask); - /* * Help non-failing allocations by giving some access to memory * reserves normally used for high priority non-blocking From b1f202060afeb7fcb98473929d26fd3d2093b067 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 30 Aug 2024 11:03:36 +0100 Subject: [PATCH 320/416] mm: remap unused subpages to shared zeropage when splitting isolated thp Patch series "mm: split underused THPs", v5. The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: | THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K This patch (of 5): Here being unused means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, the unused subpages can be mapped to the shared zeropage, hence saving memory. This is particularly helpful when the internal fragmentation of a thp is high, i.e. it has many untouched subpages. This is also a prerequisite for THP low utilization shrinker which will be introduced in later patches, where underutilized THPs are split, and the zero-filled pages are freed saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20240830100438.3623486-3-usamaarif642@gmail.com Signed-off-by: Yu Zhao Signed-off-by: Usama Arif Tested-by: Shuang Zhai Cc: Alexander Zhu Cc: Barry Song Cc: David Hildenbrand Cc: Domenico Cerasuolo Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kairui Song Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Nico Pache Cc: Rik van Riel Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Shuang Zhai Cc: Hugh Dickins Signed-off-by: Andrew Morton --- include/linux/rmap.h | 7 +++- mm/huge_memory.c | 8 ++--- mm/migrate.c | 76 +++++++++++++++++++++++++++++++++++++------- mm/migrate_device.c | 4 +-- 4 files changed, 77 insertions(+), 18 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 91b5935e8485..d5e93e44322e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -745,7 +745,12 @@ int folio_mkclean(struct folio *); int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, struct vm_area_struct *vma); -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); +enum rmp_flags { + RMP_LOCKED = 1 << 0, + RMP_USE_SHARED_ZEROPAGE = 1 << 1, +}; + +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags); /* * rmap_walk_control: To control rmap traversing for specific needs diff --git a/mm/huge_memory.c b/mm/huge_memory.c index efc625086dc7..6b69309e002f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3004,7 +3004,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, return false; } -static void remap_page(struct folio *folio, unsigned long nr) +static void remap_page(struct folio *folio, unsigned long nr, int flags) { int i = 0; @@ -3012,7 +3012,7 @@ static void remap_page(struct folio *folio, unsigned long nr) if (!folio_test_anon(folio)) return; for (;;) { - remove_migration_ptes(folio, folio, true); + remove_migration_ptes(folio, folio, RMP_LOCKED | flags); i += folio_nr_pages(folio); if (i >= nr) break; @@ -3222,7 +3222,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, if (nr_dropped) shmem_uncharge(folio->mapping->host, nr_dropped); - remap_page(folio, nr); + remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0); /* * set page to its compound_head when split to non order-0 pages, so @@ -3498,7 +3498,7 @@ fail: if (mapping) xas_unlock(&xas); local_irq_enable(); - remap_page(folio, folio_nr_pages(folio)); + remap_page(folio, folio_nr_pages(folio), 0); ret = -EAGAIN; } diff --git a/mm/migrate.c b/mm/migrate.c index 6f9c62c746be..d039863e014b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list) return true; } +static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, + struct folio *folio, + unsigned long idx) +{ + struct page *page = folio_page(folio, idx); + bool contains_data; + pte_t newpte; + void *addr; + + VM_BUG_ON_PAGE(PageCompound(page), page); + VM_BUG_ON_PAGE(!PageAnon(page), page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); + + if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) || + mm_forbids_zeropage(pvmw->vma->vm_mm)) + return false; + + /* + * The pmd entry mapping the old thp was flushed and the pte mapping + * this subpage has been non present. If the subpage is only zero-filled + * then map it to the shared zeropage. + */ + addr = kmap_local_page(page); + contains_data = memchr_inv(addr, 0, PAGE_SIZE); + kunmap_local(addr); + + if (contains_data) + return false; + + newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address), + pvmw->vma->vm_page_prot)); + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); + + dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); + return true; +} + +struct rmap_walk_arg { + struct folio *folio; + bool map_unused_to_zeropage; +}; + /* * Restore a potential migration pte to a working pte entry */ static bool remove_migration_pte(struct folio *folio, - struct vm_area_struct *vma, unsigned long addr, void *old) + struct vm_area_struct *vma, unsigned long addr, void *arg) { - DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); + struct rmap_walk_arg *rmap_walk_arg = arg; + DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION); while (page_vma_mapped_walk(&pvmw)) { rmap_t rmap_flags = RMAP_NONE; @@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio *folio, continue; } #endif + if (rmap_walk_arg->map_unused_to_zeropage && + try_to_map_unused_to_zeropage(&pvmw, folio, idx)) + continue; folio_get(folio); pte = mk_pte(new, READ_ONCE(vma->vm_page_prot)); @@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio *folio, * Get rid of all migration entries and replace them by * references to the indicated page. */ -void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked) +void remove_migration_ptes(struct folio *src, struct folio *dst, int flags) { - struct rmap_walk_control rwc = { - .rmap_one = remove_migration_pte, - .arg = src, + struct rmap_walk_arg rmap_walk_arg = { + .folio = src, + .map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE, }; - if (locked) + struct rmap_walk_control rwc = { + .rmap_one = remove_migration_pte, + .arg = &rmap_walk_arg, + }; + + VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src); + + if (flags & RMP_LOCKED) rmap_walk_locked(dst, &rwc); else rmap_walk(dst, &rwc); @@ -934,7 +988,7 @@ static int writeout(struct address_space *mapping, struct folio *folio) * At this point we know that the migration attempt cannot * be successful. */ - remove_migration_ptes(folio, folio, false); + remove_migration_ptes(folio, folio, 0); rc = mapping->a_ops->writepage(&folio->page, &wbc); @@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio *src, struct list_head *ret) { if (page_was_mapped) - remove_migration_ptes(src, src, false); + remove_migration_ptes(src, src, 0); /* Drop an anon_vma reference if we took one */ if (anon_vma) put_anon_vma(anon_vma); @@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private, lru_add_drain(); if (old_page_state & PAGE_WAS_MAPPED) - remove_migration_ptes(src, dst, false); + remove_migration_ptes(src, dst, 0); out_unlock_both: folio_unlock(dst); @@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, if (page_was_mapped) remove_migration_ptes(src, - rc == MIGRATEPAGE_SUCCESS ? dst : src, false); + rc == MIGRATEPAGE_SUCCESS ? dst : src, 0); unlock_put_anon: folio_unlock(dst); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 8d687de88a03..9cf26592ac93 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns, continue; folio = page_folio(page); - remove_migration_ptes(folio, folio, false); + remove_migration_ptes(folio, folio, 0); src_pfns[i] = 0; folio_unlock(folio); @@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long *src_pfns, dst = src; } - remove_migration_ptes(src, dst, false); + remove_migration_ptes(src, dst, 0); folio_unlock(src); if (folio_is_zone_device(src)) From 391e86971161906e720f5d3c110ca9124c14fb55 Mon Sep 17 00:00:00 2001 From: Alexander Zhu Date: Fri, 30 Aug 2024 11:03:37 +0100 Subject: [PATCH 321/416] mm: selftest to verify zero-filled pages are mapped to zeropage When a THP is split, any subpage that is zero-filled will be mapped to the shared zeropage, hence saving memory. Add selftest to verify this by allocating zero-filled THP and comparing RssAnon before and after split. Link: https://lkml.kernel.org/r/20240830100438.3623486-4-usamaarif642@gmail.com Signed-off-by: Alexander Zhu Signed-off-by: Usama Arif Acked-by: Rik van Riel Cc: Barry Song Cc: David Hildenbrand Cc: Domenico Cerasuolo Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kairui Song Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Nico Pache Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Shuang Zhai Cc: Yu Zhao Cc: Shuang Zhai Cc: Hugh Dickins Signed-off-by: Andrew Morton --- .../selftests/mm/split_huge_page_test.c | 71 +++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 22 ++++++ tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 94 insertions(+) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index e5e8dafc9d94..eb6d1b9fc362 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -84,6 +84,76 @@ static void write_debugfs(const char *fmt, ...) write_file(SPLIT_DEBUGFS, input, ret + 1); } +static char *allocate_zero_filled_hugepage(size_t len) +{ + char *result; + size_t i; + + result = memalign(pmd_pagesize, len); + if (!result) { + printf("Fail to allocate memory\n"); + exit(EXIT_FAILURE); + } + + madvise(result, len, MADV_HUGEPAGE); + + for (i = 0; i < len; i++) + result[i] = (char)0; + + return result; +} + +static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len) +{ + unsigned long rss_anon_before, rss_anon_after; + size_t i; + + if (!check_huge_anon(one_page, 4, pmd_pagesize)) { + printf("No THP is allocated\n"); + exit(EXIT_FAILURE); + } + + rss_anon_before = rss_anon(); + if (!rss_anon_before) { + printf("No RssAnon is allocated before split\n"); + exit(EXIT_FAILURE); + } + + /* split all THPs */ + write_debugfs(PID_FMT, getpid(), (uint64_t)one_page, + (uint64_t)one_page + len, 0); + + for (i = 0; i < len; i++) + if (one_page[i] != (char)0) { + printf("%ld byte corrupted\n", i); + exit(EXIT_FAILURE); + } + + if (!check_huge_anon(one_page, 0, pmd_pagesize)) { + printf("Still AnonHugePages not split\n"); + exit(EXIT_FAILURE); + } + + rss_anon_after = rss_anon(); + if (rss_anon_after >= rss_anon_before) { + printf("Incorrect RssAnon value. Before: %ld After: %ld\n", + rss_anon_before, rss_anon_after); + exit(EXIT_FAILURE); + } +} + +void split_pmd_zero_pages(void) +{ + char *one_page; + int nr_hpages = 4; + size_t len = nr_hpages * pmd_pagesize; + + one_page = allocate_zero_filled_hugepage(len); + verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len); + printf("Split zero filled huge pages successful\n"); + free(one_page); +} + void split_pmd_thp(void) { char *one_page; @@ -431,6 +501,7 @@ int main(int argc, char **argv) fd_size = 2 * pmd_pagesize; + split_pmd_zero_pages(); split_pmd_thp(); split_pte_mapped_thp(); split_file_backed_thp(); diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5a62530da3b5..d8d0cf04bb57 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -12,6 +12,7 @@ #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" #define SMAP_FILE_PATH "/proc/self/smaps" +#define STATUS_FILE_PATH "/proc/self/status" #define MAX_LINE_LENGTH 500 unsigned int __page_size; @@ -171,6 +172,27 @@ uint64_t read_pmd_pagesize(void) return strtoul(buf, NULL, 10); } +unsigned long rss_anon(void) +{ + unsigned long rss_anon = 0; + FILE *fp; + char buffer[MAX_LINE_LENGTH]; + + fp = fopen(STATUS_FILE_PATH, "r"); + if (!fp) + ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH); + + if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer))) + goto err_out; + + if (sscanf(buffer, "RssAnon:%10lu kB", &rss_anon) != 1) + ksft_exit_fail_msg("Reading status error\n"); + +err_out: + fclose(fp); + return rss_anon; +} + bool __check_huge(void *addr, char *pattern, int nr_hpages, uint64_t hpage_size) { diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 9007c420d52c..2eaed8209925 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -39,6 +39,7 @@ unsigned long pagemap_get_pfn(int fd, char *start); void clear_softdirty(void); bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len); uint64_t read_pmd_pagesize(void); +unsigned long rss_anon(void); bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size); bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size); bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size); From 8422acdc97ed5839692b45f800dbfb78abe65a94 Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 30 Aug 2024 11:03:38 +0100 Subject: [PATCH 322/416] mm: introduce a pageflag for partially mapped folios Currently folio->_deferred_list is used to keep track of partially_mapped folios that are going to be split under memory pressure. In the next patch, all THPs that are faulted in and collapsed by khugepaged are also going to be tracked using _deferred_list. This patch introduces a pageflag to be able to distinguish between partially mapped folios and others in the deferred_list at split time in deferred_split_scan. Its needed as __folio_remove_rmap decrements _mapcount, _large_mapcount and _entire_mapcount, hence it won't be possible to distinguish between partially mapped folios and others in deferred_split_scan. Eventhough it introduces an extra flag to track if the folio is partially mapped, there is no functional change intended with this patch and the flag is not useful in this patch itself, it will become useful in the next patch when _deferred_list has non partially mapped folios. Link: https://lkml.kernel.org/r/20240830100438.3623486-5-usamaarif642@gmail.com Signed-off-by: Usama Arif Cc: Alexander Zhu Cc: Barry Song Cc: David Hildenbrand Cc: Domenico Cerasuolo Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kairui Song Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Nico Pache Cc: Rik van Riel Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Shuang Zhai Cc: Yu Zhao Cc: Shuang Zhai Cc: Hugh Dickins Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 4 ++-- include/linux/page-flags.h | 13 +++++++++++- mm/huge_memory.c | 41 ++++++++++++++++++++++++++++---------- mm/memcontrol.c | 3 ++- mm/migrate.c | 3 ++- mm/page_alloc.c | 5 +++-- mm/rmap.c | 5 +++-- mm/vmscan.c | 3 ++- 8 files changed, 56 insertions(+), 21 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4902e2f7e896..e3896ebf0f94 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -333,7 +333,7 @@ static inline int split_huge_page(struct page *page) { return split_huge_page_to_list_to_order(page, NULL, 0); } -void deferred_split_folio(struct folio *folio); +void deferred_split_folio(struct folio *folio, bool partially_mapped); void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze, struct folio *folio); @@ -507,7 +507,7 @@ static inline int split_huge_page(struct page *page) { return 0; } -static inline void deferred_split_folio(struct folio *folio) {} +static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {} #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 2175ebceb41c..1b3a76710487 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -186,6 +186,7 @@ enum pageflags { /* At least one page in this folio has the hwpoison flag set */ PG_has_hwpoisoned = PG_active, PG_large_rmappable = PG_workingset, /* anon or file-backed */ + PG_partially_mapped = PG_reclaim, /* was identified to be partially mapped */ }; #define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1) @@ -859,8 +860,18 @@ static inline void ClearPageCompound(struct page *page) ClearPageHead(page); } FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE) +FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE) +/* + * PG_partially_mapped is protected by deferred_split split_queue_lock, + * so its safe to use non-atomic set/clear. + */ +__FOLIO_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE) +__FOLIO_CLEAR_FLAG(partially_mapped, FOLIO_SECOND_PAGE) #else FOLIO_FLAG_FALSE(large_rmappable) +FOLIO_TEST_FLAG_FALSE(partially_mapped) +__FOLIO_SET_FLAG_NOOP(partially_mapped) +__FOLIO_CLEAR_FLAG_NOOP(partially_mapped) #endif #define PG_head_mask ((1UL << PG_head)) @@ -1171,7 +1182,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page) */ #define PAGE_FLAGS_SECOND \ (0xffUL /* order */ | 1UL << PG_has_hwpoisoned | \ - 1UL << PG_large_rmappable) + 1UL << PG_large_rmappable | 1UL << PG_partially_mapped) #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6b69309e002f..9bc435bef2e6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3459,7 +3459,11 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (folio_order(folio) > 1 && !list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; - mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + if (folio_test_partially_mapped(folio)) { + __folio_clear_partially_mapped(folio); + mod_mthp_stat(folio_order(folio), + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + } /* * Reinitialize page_deferred_list after removing the * page from the split_queue, otherwise a subsequent @@ -3526,13 +3530,18 @@ void __folio_undo_large_rmappable(struct folio *folio) spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; - mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + if (folio_test_partially_mapped(folio)) { + __folio_clear_partially_mapped(folio); + mod_mthp_stat(folio_order(folio), + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + } list_del_init(&folio->_deferred_list); } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } -void deferred_split_folio(struct folio *folio) +/* partially_mapped=false won't clear PG_partially_mapped folio flag */ +void deferred_split_folio(struct folio *folio, bool partially_mapped) { struct deferred_split *ds_queue = get_deferred_split_queue(folio); #ifdef CONFIG_MEMCG @@ -3560,15 +3569,21 @@ void deferred_split_folio(struct folio *folio) if (folio_test_swapcache(folio)) return; - if (!list_empty(&folio->_deferred_list)) - return; - spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + if (partially_mapped) { + if (!folio_test_partially_mapped(folio)) { + __folio_set_partially_mapped(folio); + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_DEFERRED_SPLIT_PAGE); + count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); + mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1); + + } + } else { + /* partially mapped folios cannot become non-partially mapped */ + VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio); + } if (list_empty(&folio->_deferred_list)) { - if (folio_test_pmd_mappable(folio)) - count_vm_event(THP_DEFERRED_SPLIT_PAGE); - count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); - mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1); list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); ds_queue->split_queue_len++; #ifdef CONFIG_MEMCG @@ -3616,7 +3631,11 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, list_move(&folio->_deferred_list, &list); } else { /* We lost race with folio_put() */ - mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + if (folio_test_partially_mapped(folio)) { + __folio_clear_partially_mapped(folio); + mod_mthp_stat(folio_order(folio), + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + } list_del_init(&folio->_deferred_list); ds_queue->split_queue_len--; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0370ce03ce4c..5abff3b0cae5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4645,7 +4645,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); VM_BUG_ON_FOLIO(folio_order(folio) > 1 && !folio_test_hugetlb(folio) && - !list_empty(&folio->_deferred_list), folio); + !list_empty(&folio->_deferred_list) && + folio_test_partially_mapped(folio), folio); /* * Nobody should be changing or seriously looking at diff --git a/mm/migrate.c b/mm/migrate.c index d039863e014b..35cc9d35064b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1766,7 +1766,8 @@ static int migrate_pages_batch(struct list_head *from, * use _deferred_list. */ if (nr_pages > 2 && - !list_empty(&folio->_deferred_list)) { + !list_empty(&folio->_deferred_list) && + folio_test_partially_mapped(folio)) { if (!try_split_folio(folio, split_folios, mode)) { nr_failed++; stats->nr_thp_failed += is_thp; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 23509b823e2b..ea4c44c3cfc0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -962,8 +962,9 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) break; case 2: /* the second tail page: deferred_list overlaps ->mapping */ - if (unlikely(!list_empty(&folio->_deferred_list))) { - bad_page(page, "on deferred list"); + if (unlikely(!list_empty(&folio->_deferred_list) && + folio_test_partially_mapped(folio))) { + bad_page(page, "partially mapped folio on deferred list"); goto out; } break; diff --git a/mm/rmap.c b/mm/rmap.c index 78529cf0fd66..a8797d1b3d49 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1579,8 +1579,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, * Check partially_mapped first to ensure it is a large folio. */ if (partially_mapped && folio_test_anon(folio) && - list_empty(&folio->_deferred_list)) - deferred_split_folio(folio); + !folio_test_partially_mapped(folio)) + deferred_split_folio(folio, true); + __folio_mod_stat(folio, -nr, -nr_pmdmapped); /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 1b1fad0c1e11..8163b24bec75 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1238,7 +1238,8 @@ retry: * Split partially mapped folios right away. * We can free the unmapped pages without IO. */ - if (data_race(!list_empty(&folio->_deferred_list)) && + if (data_race(!list_empty(&folio->_deferred_list) && + folio_test_partially_mapped(folio)) && split_folio_to_list(folio, folio_list)) goto activate_locked; } From dafff3f4c850c98e15501bc6ee20581f9a013dcc Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 30 Aug 2024 11:03:39 +0100 Subject: [PATCH 323/416] mm: split underused THPs This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-6-usamaarif642@gmail.com Signed-off-by: Usama Arif Suggested-by: Rik van Riel Co-authored-by: Johannes Weiner Cc: Alexander Zhu Cc: Barry Song Cc: David Hildenbrand Cc: Domenico Cerasuolo Cc: Jonathan Corbet Cc: Kairui Song Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Nico Pache Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Shuang Zhai Cc: Yu Zhao Cc: Shuang Zhai Cc: Hugh Dickins Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 6 +++ include/linux/khugepaged.h | 1 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 60 +++++++++++++++++++++- mm/khugepaged.c | 3 +- mm/vmstat.c | 1 + 6 files changed, 69 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 56a086900651..aca0cff852b8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -471,6 +471,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underused_split_page + is incremented when a huge page on the split queue was split + because it was underused. A THP is underused if the number of + zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index f68865e19b0b..30baae91b225 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -4,6 +4,7 @@ #include /* MMF_VM_HUGEPAGE */ +extern unsigned int khugepaged_max_ptes_none __read_mostly; #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern struct attribute_group khugepaged_attr_group; diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index aae5c7c5cfb4..aed952d04132 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED, THP_DEFERRED_SPLIT_PAGE, + THP_UNDERUSED_SPLIT_PAGE, THP_SPLIT_PMD, THP_SCAN_EXCEED_NONE_PTE, THP_SCAN_EXCEED_SWAP_PTE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9bc435bef2e6..cec5bce046a0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1187,6 +1187,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(vma->vm_mm); + deferred_split_folio(folio, false); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); @@ -3608,6 +3609,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink, return READ_ONCE(ds_queue->split_queue_len); } +static bool thp_underused(struct folio *folio) +{ + int num_zero_pages = 0, num_filled_pages = 0; + void *kaddr; + int i; + + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) + return false; + + for (i = 0; i < folio_nr_pages(folio); i++) { + kaddr = kmap_local_folio(folio, i * PAGE_SIZE); + if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { + num_zero_pages++; + if (num_zero_pages > khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return true; + } + } else { + /* + * Another path for early exit once the number + * of non-zero filled pages exceeds threshold. + */ + num_filled_pages++; + if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return false; + } + } + kunmap_local(kaddr); + } + return false; +} + static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { @@ -3645,13 +3679,35 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); list_for_each_entry_safe(folio, next, &list, _deferred_list) { + bool did_split = false; + bool underused = false; + + if (!folio_test_partially_mapped(folio)) { + underused = thp_underused(folio); + if (!underused) + goto next; + } if (!folio_trylock(folio)) goto next; - /* split_huge_page() removes page from list on success */ - if (!split_folio(folio)) + if (!split_folio(folio)) { + did_split = true; + if (underused) + count_vm_event(THP_UNDERUSED_SPLIT_PAGE); split++; + } folio_unlock(folio); next: + /* + * split_folio() removes folio from list on success. + * Only add back to the queue if folio is partially mapped. + * If thp_underused returns false, or if split_folio fails + * in the case it was underused, then consider it used and + * don't add it back to split_queue. + */ + if (!did_split && !folio_test_partially_mapped(folio)) { + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; + } folio_put(folio); } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ab646018ce25..32100041aef3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); * * Note that these are only respected if collapse was initiated by khugepaged. */ -static unsigned int khugepaged_max_ptes_none __read_mostly; +unsigned int khugepaged_max_ptes_none __read_mostly; static unsigned int khugepaged_max_ptes_swap __read_mostly; static unsigned int khugepaged_max_ptes_shared __read_mostly; @@ -1237,6 +1237,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); + deferred_split_folio(folio, false); spin_unlock(pmd_ptl); folio = NULL; diff --git a/mm/vmstat.c b/mm/vmstat.c index aea58e9fce60..b5a4cea423e1 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1385,6 +1385,7 @@ const char * const vmstat_text[] = { "thp_split_page", "thp_split_page_failed", "thp_deferred_split_page", + "thp_underused_split_page", "thp_split_pmd", "thp_scan_exceed_none_pte", "thp_scan_exceed_swap_pte", From 81d3ff3c6f76306b22138d69e980634f53bf17fa Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 30 Aug 2024 11:03:40 +0100 Subject: [PATCH 324/416] mm: add sysfs entry to disable splitting underused THPs If disabled, THPs faulted in or collapsed will not be added to _deferred_list, and therefore won't be considered for splitting under memory pressure if underused. Link: https://lkml.kernel.org/r/20240830100438.3623486-7-usamaarif642@gmail.com Signed-off-by: Usama Arif Cc: Alexander Zhu Cc: Barry Song Cc: David Hildenbrand Cc: Domenico Cerasuolo Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kairui Song Cc: Matthew Wilcox Cc: Mike Rapoport Cc: Nico Pache Cc: Rik van Riel Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Shuang Zhai Cc: Shuang Zhai Cc: Yu Zhao Cc: Hugh Dickins Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/transhuge.rst | 10 +++++++++ mm/huge_memory.c | 26 ++++++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index aca0cff852b8..cfdd16a52e39 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,6 +202,16 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size +All THPs at fault and collapse time will be added to _deferred_list, +and will therefore be split under memory presure if they are considered +"underused". A THP is underused if the number of zero-filled pages in +the THP is above max_ptes_none (see below). It is possible to disable +this behaviour by writing 0 to shrink_underused, and enable it by writing +1 to it:: + + echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused + echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused + khugepaged will be automatically started when PMD-sized THP is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cec5bce046a0..691702e39f85 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -74,6 +74,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink, struct shrink_control *sc); static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc); +static bool split_underused_thp = true; static atomic_t huge_zero_refcount; struct folio *huge_zero_folio __read_mostly; @@ -440,6 +441,27 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); +static ssize_t split_underused_thp_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", split_underused_thp); +} + +static ssize_t split_underused_thp_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err = kstrtobool(buf, &split_underused_thp); + + if (err < 0) + return err; + + return count; +} + +static struct kobj_attribute split_underused_thp_attr = __ATTR( + shrink_underused, 0644, split_underused_thp_show, split_underused_thp_store); + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -448,6 +470,7 @@ static struct attribute *hugepage_attr[] = { #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif + &split_underused_thp_attr.attr, NULL, }; @@ -3557,6 +3580,9 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped) if (folio_order(folio) <= 1) return; + if (!partially_mapped && !split_underused_thp) + return; + /* * The try_to_unmap() in page reclaim path might reach here too, * this may cause a race condition to corrupt deferred split queue. From 7ae12a57c56e0b87e6e698479c1f8b65434a608f Mon Sep 17 00:00:00 2001 From: Hongbo Li Date: Wed, 28 Aug 2024 12:12:16 +0800 Subject: [PATCH 325/416] mm/vmalloc.c: make use of the helper macro LIST_HEAD() list_head can be initialized automatically with LIST_HEAD() instead of calling INIT_LIST_HEAD(). Here we can simplify the code. Link: https://lkml.kernel.org/r/20240828041216.1222582-1-lihongbo22@huawei.com Signed-off-by: Hongbo Li Reviewed-by: Uladzislau Rezki (Sony) Cc: Christoph Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 264a2627bea4..efba551d81aa 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2131,23 +2131,18 @@ reclaim_list_global(struct list_head *head) static void decay_va_pool_node(struct vmap_node *vn, bool full_decay) { + LIST_HEAD(decay_list); + struct rb_root decay_root = RB_ROOT; struct vmap_area *va, *nva; - struct list_head decay_list; - struct rb_root decay_root; unsigned long n_decay; int i; - decay_root = RB_ROOT; - INIT_LIST_HEAD(&decay_list); - for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { - struct list_head tmp_list; + LIST_HEAD(tmp_list); if (list_empty(&vn->pool[i].head)) continue; - INIT_LIST_HEAD(&tmp_list); - /* Detach the pool, so no-one can access it. */ spin_lock(&vn->pool_lock); list_replace_init(&vn->pool[i].head, &tmp_list); From 536ab838a5b37b6ae3f8d53552560b7c51daeb41 Mon Sep 17 00:00:00 2001 From: Dev Jain Date: Fri, 30 Aug 2024 10:46:09 +0530 Subject: [PATCH 326/416] selftests/mm: relax test to fail after 100 migration failures It was recently observed at [1] that during the folio unmapping stage of migration, when the PTEs are cleared, a racing thread faulting on that folio may increase the refcount of the folio, sleep on the folio lock (the migration path has the lock), and migration ultimately fails when asserting the actual refcount against the expected. Thereby, the migration selftest fails on shared-anon mappings. The above enforces the fact that migration is a best-effort service, therefore, it is wrong to fail the test for just a single failure; hence, fail the test after 100 consecutive failures (where 100 is still a subjective choice). Note that, this has no effect on the execution time of the test since that is controlled by a timeout. [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ Link: https://lkml.kernel.org/r/20240830051609.4037834-1-dev.jain@arm.com Signed-off-by: Dev Jain Suggested-by: David Hildenbrand Reviewed-by: Ryan Roberts Tested-by: Ryan Roberts Cc: Alistair Popple Cc: Aneesh Kumar K.V Cc: Anshuman Khandual Cc: Baolin Wang Cc: Barry Song Cc: Catalin Marinas Cc: Christoph Lameter Cc: Dave Hansen Cc: Gavin Shan Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Lance Yang Cc: Mark Brown Cc: Mark Rutland Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Oscar Salvador Cc: Shuah Khan Cc: Vlastimil Babka Cc: Will Deacon Cc: Yang Shi Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/migration.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index 6908569ef406..64bcbb7151cf 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -15,10 +15,10 @@ #include #include -#define TWOMEG (2<<20) -#define RUNTIME (20) - -#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) +#define TWOMEG (2<<20) +#define RUNTIME (20) +#define MAX_RETRIES 100 +#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) FIXTURE(migration) { @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) int ret, tmp; int status = 0; struct timespec ts1, ts2; + int failures = 0; if (clock_gettime(CLOCK_MONOTONIC, &ts1)) return -1; @@ -79,13 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) ret = move_pages(0, 1, (void **) &ptr, &n2, &status, MPOL_MF_MOVE_ALL); if (ret) { - if (ret > 0) + if (ret > 0) { + /* Migration is best effort; try again */ + if (++failures < MAX_RETRIES) + continue; printf("Didn't migrate %d pages\n", ret); + } else perror("Couldn't migrate pages"); return -2; } - + failures = 0; tmp = n2; n2 = n1; n1 = tmp; From 94deaf69dcd33462c61fa8cabb0883e3085a1046 Mon Sep 17 00:00:00 2001 From: Huan Yang Date: Mon, 26 Aug 2024 14:40:48 +0800 Subject: [PATCH 327/416] mm: page_alloc: simpify page del and expand When page del from buddy and need expand, it will account free_pages in zone's migratetype. The current way is to subtract the page number of the current order when deleting, and then add it back when expanding. This is unnecessary, as when migrating the same type, we can directly record the difference between the high-order pages and the expand added, and then subtract it directly. This patch merge that, only when del and expand done, then account free_pages. Link: https://lkml.kernel.org/r/20240826064048.187790-1-link@vivo.com Signed-off-by: Huan Yang Reviewed-by: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ea4c44c3cfc0..ee377bf5c033 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1367,11 +1367,11 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn, * * -- nyc */ -static inline void expand(struct zone *zone, struct page *page, - int low, int high, int migratetype) +static inline unsigned int expand(struct zone *zone, struct page *page, int low, + int high, int migratetype) { - unsigned long size = 1 << high; - unsigned long nr_added = 0; + unsigned int size = 1 << high; + unsigned int nr_added = 0; while (high > low) { high--; @@ -1391,7 +1391,19 @@ static inline void expand(struct zone *zone, struct page *page, set_buddy_order(&page[size], high); nr_added += size; } - account_freepages(zone, nr_added, migratetype); + + return nr_added; +} + +static __always_inline void page_del_and_expand(struct zone *zone, + struct page *page, int low, + int high, int migratetype) +{ + int nr_pages = 1 << high; + + __del_page_from_free_list(page, zone, high, migratetype); + nr_pages -= expand(zone, page, low, high, migratetype); + account_freepages(zone, -nr_pages, migratetype); } static void check_new_page_bad(struct page *page) @@ -1561,8 +1573,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order, migratetype); - expand(zone, page, order, current_order, migratetype); + + page_del_and_expand(zone, page, order, current_order, + migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@ -1892,9 +1905,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page, /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + unsigned int nr_added; + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); - expand(zone, page, order, current_order, start_type); + nr_added = expand(zone, page, order, current_order, start_type); + account_freepages(zone, nr_added, start_type); return page; } @@ -1947,8 +1963,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, } single_page: - del_page_from_free_list(page, zone, current_order, block_type); - expand(zone, page, order, current_order, block_type); + page_del_and_expand(zone, page, order, current_order, block_type); return page; } From 96ae4c9019c5651ded92e97828e341529d5863bc Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 30 Aug 2024 22:04:00 +0000 Subject: [PATCH 328/416] maple_tree: cleanup function descriptions This patch tries to cleanup some function description: * function name mismatch * parameter name mismatch * parameter all end up with ':' * not prefix '*' if parameter is a pointer There is still some missing description of parameters, I didn't add them since I am not sure the exact meaning. Link: https://lkml.kernel.org/r/20240830220400.2007-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Liam R. Howlett Signed-off-by: Andrew Morton --- lib/maple_tree.c | 105 +++++++++++++++++++++-------------------------- 1 file changed, 47 insertions(+), 58 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 8a3c2c344f62..2058452e5ff3 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -474,6 +474,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode) /* * mas_set_parent() - Set the parent node and encode the slot + * @mas: The maple state * @enode: The encoded maple node. * @parent: The encoded maple node that is the parent of @enode. * @slot: The slot that @enode resides in @parent. @@ -534,7 +535,7 @@ unsigned int mte_parent_slot(const struct maple_enode *enode) /* * mte_parent() - Get the parent of @node. - * @node: The encoded maple node. + * @enode: The encoded maple node. * * Return: The parent maple node. */ @@ -641,8 +642,8 @@ static inline unsigned int mas_alloc_req(const struct ma_state *mas) /* * ma_pivots() - Get a pointer to the maple node pivots. - * @node - the maple node - * @type - the node type + * @node: the maple node + * @type: the node type * * In the event of a dead node, this array may be %NULL * @@ -665,8 +666,8 @@ static inline unsigned long *ma_pivots(struct maple_node *node, /* * ma_gaps() - Get a pointer to the maple node gaps. - * @node - the maple node - * @type - the node type + * @node: the maple node + * @type: the node type * * Return: A pointer to the maple node gaps */ @@ -880,8 +881,6 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt, * @mt: The maple tree * @mn: The maple node * @type: The maple node type - * @offset: The offset of the highest sub-gap in this node. - * @end: The end of the data in this node. */ static inline void mt_clear_meta(struct maple_tree *mt, struct maple_node *mn, enum maple_type type) @@ -939,7 +938,7 @@ static inline unsigned char ma_meta_gap(struct maple_node *mn) /* * ma_set_meta_gap() - Set the largest gap location in a nodes metadata * @mn: The maple node - * @mn: The maple node type + * @mt: The maple node type * @offset: The location of the largest gap. */ static inline void ma_set_meta_gap(struct maple_node *mn, enum maple_type mt, @@ -953,8 +952,8 @@ static inline void ma_set_meta_gap(struct maple_node *mn, enum maple_type mt, /* * mat_add() - Add a @dead_enode to the ma_topiary of a list of dead nodes. - * @mat - the ma_topiary, a linked list of dead nodes. - * @dead_enode - the node to be marked as dead and added to the tail of the list + * @mat: the ma_topiary, a linked list of dead nodes. + * @dead_enode: the node to be marked as dead and added to the tail of the list * * Add the @dead_enode to the linked list in @mat. */ @@ -977,8 +976,8 @@ static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt, bool free); /* * mas_mat_destroy() - Free all nodes and subtrees in a dead list. - * @mas - the maple state - * @mat - the ma_topiary linked list of dead nodes to free. + * @mas: the maple state + * @mat: the ma_topiary linked list of dead nodes to free. * * Destroy walk a dead list. */ @@ -999,7 +998,7 @@ static void mas_mat_destroy(struct ma_state *mas, struct ma_topiary *mat) } /* * mas_descend() - Descend into the slot stored in the ma_state. - * @mas - the maple state. + * @mas: the maple state. * * Note: Not RCU safe, only use in write side or debug code. */ @@ -1462,7 +1461,7 @@ static inline unsigned char mas_data_end(struct ma_state *mas) /* * mas_leaf_max_gap() - Returns the largest gap in a leaf node - * @mas - the maple state + * @mas: the maple state * * Return: The maximum gap in the leaf. */ @@ -1544,7 +1543,7 @@ static unsigned long mas_leaf_max_gap(struct ma_state *mas) * @node: The maple node * @gaps: The pointer to the gaps * @mt: The maple node type - * @*off: Pointer to store the offset location of the gap. + * @off: Pointer to store the offset location of the gap. * * Uses the metadata data end to scan backwards across set gaps. * @@ -1651,7 +1650,7 @@ ascend: /* * mas_update_gap() - Update a nodes gaps and propagate up if necessary. - * @mas - the maple state. + * @mas: the maple state. */ static inline void mas_update_gap(struct ma_state *mas) { @@ -1678,8 +1677,8 @@ static inline void mas_update_gap(struct ma_state *mas) /* * mas_adopt_children() - Set the parent pointer of all nodes in @parent to * @parent with the slot encoded. - * @mas - the maple state (for the tree) - * @parent - the maple encoded node containing the children. + * @mas: the maple state (for the tree) + * @parent: the maple encoded node containing the children. */ static inline void mas_adopt_children(struct ma_state *mas, struct maple_enode *parent) @@ -1701,8 +1700,8 @@ static inline void mas_adopt_children(struct ma_state *mas, /* * mas_put_in_tree() - Put a new node in the tree, smp_wmb(), and mark the old * node as dead. - * @mas - the maple state with the new node - * @old_enode - The old maple encoded node to replace. + * @mas: the maple state with the new node + * @old_enode: The old maple encoded node to replace. */ static inline void mas_put_in_tree(struct ma_state *mas, struct maple_enode *old_enode) @@ -1730,8 +1729,8 @@ static inline void mas_put_in_tree(struct ma_state *mas, * mas_replace_node() - Replace a node by putting it in the tree, marking it * dead, and freeing it. * the parent encoding to locate the maple node in the tree. - * @mas - the ma_state with @mas->node pointing to the new node. - * @old_enode - The old maple encoded node. + * @mas: the ma_state with @mas->node pointing to the new node. + * @old_enode: The old maple encoded node. */ static inline void mas_replace_node(struct ma_state *mas, struct maple_enode *old_enode) @@ -1796,7 +1795,6 @@ static inline void mab_shift_right(struct maple_big_node *b_node, /* * mab_middle_node() - Check if a middle node is needed (unlikely) * @b_node: the maple_big_node that contains the data. - * @size: the amount of data in the b_node * @split: the potential split location * @slot_count: the size that can be stored in a single node being considered. * @@ -1844,6 +1842,7 @@ static inline int mab_no_null_split(struct maple_big_node *b_node, /* * mab_calc_split() - Calculate the split location and if there needs to be two * splits. + * @mas: The maple state * @bn: The maple_big_node with the data * @mid_split: The second split, if required. 0 otherwise. * @@ -2177,7 +2176,8 @@ static inline bool mas_next_sibling(struct ma_state *mas) } /* - * mte_node_or_none() - Set the enode and state. + * mas_node_or_none() - Set the enode and state. + * @mas: the maple state * @enode: The encoded maple node. * * Set the node to the enode and the status. @@ -2228,7 +2228,6 @@ static inline void mas_wr_node_walk(struct ma_wr_state *wr_mas) /* * mast_rebalance_next() - Rebalance against the next node * @mast: The maple subtree state - * @old_r: The encoded maple node to the right (next node). */ static inline void mast_rebalance_next(struct maple_subtree_state *mast) { @@ -2242,7 +2241,6 @@ static inline void mast_rebalance_next(struct maple_subtree_state *mast) /* * mast_rebalance_prev() - Rebalance against the previous node * @mast: The maple subtree state - * @old_l: The encoded maple node to the left (previous node) */ static inline void mast_rebalance_prev(struct maple_subtree_state *mast) { @@ -2393,9 +2391,9 @@ static inline unsigned char mas_mab_to_node(struct ma_state *mas, /* * mab_set_b_end() - Add entry to b_node at b_node->b_end and increment the end * pointer. - * @b_node - the big node to add the entry - * @mas - the maple state to get the pivot (mas->max) - * @entry - the entry to add, if NULL nothing happens. + * @b_node: the big node to add the entry + * @mas: the maple state to get the pivot (mas->max) + * @entry: the entry to add, if NULL nothing happens. */ static inline void mab_set_b_end(struct maple_big_node *b_node, struct ma_state *mas, @@ -2414,11 +2412,11 @@ static inline void mab_set_b_end(struct maple_big_node *b_node, * mas_set_split_parent() - combine_then_separate helper function. Sets the parent * of @mas->node to either @left or @right, depending on @slot and @split * - * @mas - the maple state with the node that needs a parent - * @left - possible parent 1 - * @right - possible parent 2 - * @slot - the slot the mas->node was placed - * @split - the split location between @left and @right + * @mas: the maple state with the node that needs a parent + * @left: possible parent 1 + * @right: possible parent 2 + * @slot: the slot the mas->node was placed + * @split: the split location between @left and @right */ static inline void mas_set_split_parent(struct ma_state *mas, struct maple_enode *left, @@ -2438,11 +2436,11 @@ static inline void mas_set_split_parent(struct ma_state *mas, /* * mte_mid_split_check() - Check if the next node passes the mid-split - * @**l: Pointer to left encoded maple node. - * @**m: Pointer to middle encoded maple node. - * @**r: Pointer to right encoded maple node. + * @l: Pointer to left encoded maple node. + * @m: Pointer to middle encoded maple node. + * @r: Pointer to right encoded maple node. * @slot: The offset - * @*split: The split location. + * @split: The split location. * @mid_split: The middle split. */ static inline void mte_mid_split_check(struct maple_enode **l, @@ -2466,10 +2464,10 @@ static inline void mte_mid_split_check(struct maple_enode **l, /* * mast_set_split_parents() - Helper function to set three nodes parents. Slot * is taken from @mast->l. - * @mast - the maple subtree state - * @left - the left node - * @right - the right node - * @split - the split location. + * @mast: the maple subtree state + * @left: the left node + * @right: the right node + * @split: the split location. */ static inline void mast_set_split_parents(struct maple_subtree_state *mast, struct maple_enode *left, @@ -2503,7 +2501,6 @@ static inline void mast_set_split_parents(struct maple_subtree_state *mast, /* * mas_topiary_node() - Dispose of a single node * @mas: The maple state for pushing nodes - * @enode: The encoded maple node * @in_rcu: If the tree is in rcu mode * * The node will either be RCU freed or pushed back on the maple state. @@ -2635,7 +2632,7 @@ static inline void mas_topiary_replace(struct ma_state *mas, /* * mas_wmb_replace() - Write memory barrier and replace * @mas: The maple state - * @old: The old maple encoded node that is being replaced. + * @old_enode: The old maple encoded node that is being replaced. * * Updates gap as necessary. */ @@ -3455,10 +3452,7 @@ static inline void mas_store_root(struct ma_state *mas, void *entry) /* * mas_is_span_wr() - Check if the write needs to be treated as a write that * spans the node. - * @mas: The maple state - * @piv: The pivot value being written - * @type: The maple node type - * @entry: The data to write + * @wr_mas: The maple write state * * Spanning writes are writes that start in one node and end in another OR if * the write of a %NULL will cause the node to end with a %NULL. @@ -4052,10 +4046,7 @@ static void mas_wr_bnode(struct ma_wr_state *wr_mas) /* * mas_wr_store_entry() - Internal call to store a value - * @mas: The maple state - * @entry: The entry to store. - * - * Return: The contents that was stored at the index. + * @wr_mas: The maple write state */ static inline void mas_wr_store_entry(struct ma_wr_state *wr_mas) { @@ -4493,9 +4484,8 @@ no_entry: * mas_prev_slot() - Get the entry in the previous slot * * @mas: The maple state - * @max: The minimum starting range + * @min: The minimum starting range * @empty: Can be empty - * @set_underflow: Set the @mas->node to underflow state on limit. * * Return: The entry in the previous slot which is possibly NULL */ @@ -4578,6 +4568,7 @@ underflow: /* * mas_next_node() - Get the next node at the same level in the tree. * @mas: The maple state + * @node: The maple node * @max: The maximum pivot value to check. * * The next value will be mas->node[mas->offset] or the status will have @@ -4668,8 +4659,6 @@ overflow: * @mas: The maple state * @max: The maximum starting range * @empty: Can be empty - * @set_overflow: Should @mas->node be set to overflow when the limit is - * reached. * * Return: The entry in the next slot which is possibly NULL */ @@ -5203,9 +5192,9 @@ EXPORT_SYMBOL_GPL(mas_empty_area_rev); /* * mte_dead_leaves() - Mark all leaves of a node as dead. - * @mas: The maple state + * @enode: the encoded node + * @mt: the maple tree * @slots: Pointer to the slot array - * @type: The maple node type * * Must hold the write lock. * From 0a6fff20d36bc81156a49649f622b152330fed3c Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 1 Sep 2024 16:24:59 -0400 Subject: [PATCH 329/416] mm: fix folio_alloc_noprof() folio_alloc_noprof) wasn't calling the _noprof version, causing allocations to be accounted here instead of to the caller Link: https://lkml.kernel.org/r/20240901202459.4867-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet Signed-off-by: Andrew Morton --- include/linux/gfp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 03ba9563c6db..a951de920e20 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -319,7 +319,7 @@ static inline struct page *alloc_pages_mpol_noprof(gfp_t gfp, unsigned int order } static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order) { - return __folio_alloc_node(gfp, order, numa_node_id()); + return __folio_alloc_node_noprof(gfp, order, numa_node_id()); } static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order, struct mempolicy *mpol, pgoff_t ilx, int nid) From 6050df6d706f58b2240bfabece9f406170104d57 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 9 Aug 2024 02:01:15 +0000 Subject: [PATCH 330/416] maple_tree: fix comment typo on ma_flag of allocation tree The maple tree flag of allocation tree is MT_FLAGS_ALLOC_RANGE. Just correct it. Link: https://lkml.kernel.org/r/20240809020115.31575-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Liam R. Howlett Signed-off-by: Andrew Morton --- include/linux/maple_tree.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h index 8e1504a81cd2..c2c11004085e 100644 --- a/include/linux/maple_tree.h +++ b/include/linux/maple_tree.h @@ -52,9 +52,9 @@ * bit in the node type. This is possible by using bit 1 to indicate if bit 2 * is part of the type or the slot. * - * Once the type is decided, the decision of an allocation range type or a range - * type is done by examining the immutable tree flag for the MAPLE_ALLOC_RANGE - * flag. + * Once the type is decided, the decision of an allocation range type or a + * range type is done by examining the immutable tree flag for the + * MT_FLAGS_ALLOC_RANGE flag. * * Node types: * 0x??1 = Root From 4fc4187984e5dbd7ad63c3acf0b228fa9bf298d3 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:49 +0900 Subject: [PATCH 331/416] lib: zstd: export API needed for dictionary support Patch series "zram: introduce custom comp backends API", v7. This series introduces support for run-time compression algorithms tuning, so users, for instance, can adjust compression/acceleration levels and provide pre-trained compression/decompression dictionaries which certain algorithms support. At this point we stop supporting (old/deprecated) comp API. We may add new acomp API support in the future, but before that zram needs to undergo some major rework (we are not ready for async compression). Some benchmarks for reference (look at column #2) *** init zstd /sys/block/zram0/mm_stat 1750659072 504622188 514355200 0 514355200 1 0 34204 34204 *** init zstd dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750650880 465908890 475398144 0 475398144 1 0 34185 34185 *** init zstd level=8 dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750654976 430803319 439873536 0 439873536 1 0 34185 34185 *** init lz4 /sys/block/zram0/mm_stat 1750646784 664266564 677060608 0 677060608 1 0 34288 34288 *** init lz4 dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750650880 619990300 632102912 0 632102912 1 0 34278 34278 *** init lz4hc /sys/block/zram0/mm_stat 1750630400 609023822 621232128 0 621232128 1 0 34288 34288 *** init lz4hc dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750659072 505133172 515231744 0 515231744 1 0 34278 34278 Recompress init zram zstd (prio=0), zstd level=5 (prio 1), zstd with dict (prio 2) *** zstd /sys/block/zram0/mm_stat 1750982656 504630584 514269184 0 514269184 1 0 34204 34204 *** idle recompress priority=1 (zstd level=5) /sys/block/zram0/mm_stat 1750982656 488645601 525438976 0 514269184 1 0 34204 34204 *** idle recompress priority=2 (zstd dict) /sys/block/zram0/mm_stat 1750982656 460869640 517914624 0 514269184 1 0 34185 34204 This patch (of 24): We need to export a number of API functions that enable advanced zstd usage - C/D dictionaries, dictionaries sharing between contexts, etc. Link: https://lkml.kernel.org/r/20240902105656.1383858-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Nick Terrell Cc: Minchan Kim Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton --- include/linux/zstd.h | 167 ++++++++++++++++++++++++++++++ lib/zstd/zstd_compress_module.c | 49 +++++++++ lib/zstd/zstd_decompress_module.c | 36 +++++++ 3 files changed, 252 insertions(+) diff --git a/include/linux/zstd.h b/include/linux/zstd.h index 113408eef6ec..b2c7cf310c8f 100644 --- a/include/linux/zstd.h +++ b/include/linux/zstd.h @@ -77,6 +77,30 @@ int zstd_min_clevel(void); */ int zstd_max_clevel(void); +/** + * zstd_default_clevel() - default compression level + * + * Return: Default compression level. + */ +int zstd_default_clevel(void); + +/** + * struct zstd_custom_mem - custom memory allocation + */ +typedef ZSTD_customMem zstd_custom_mem; + +/** + * struct zstd_dict_load_method - Dictionary load method. + * See zstd_lib.h. + */ +typedef ZSTD_dictLoadMethod_e zstd_dict_load_method; + +/** + * struct zstd_dict_content_type - Dictionary context type. + * See zstd_lib.h. + */ +typedef ZSTD_dictContentType_e zstd_dict_content_type; + /* ====== Parameter Selection ====== */ /** @@ -136,6 +160,19 @@ typedef ZSTD_parameters zstd_parameters; zstd_parameters zstd_get_params(int level, unsigned long long estimated_src_size); + +/** + * zstd_get_cparams() - returns zstd_compression_parameters for selected level + * @level: The compression level + * @estimated_src_size: The estimated source size to compress or 0 + * if unknown. + * @dict_size: Dictionary size. + * + * Return: The selected zstd_compression_parameters. + */ +zstd_compression_parameters zstd_get_cparams(int level, + unsigned long long estimated_src_size, size_t dict_size); + /* ====== Single-pass Compression ====== */ typedef ZSTD_CCtx zstd_cctx; @@ -180,6 +217,71 @@ zstd_cctx *zstd_init_cctx(void *workspace, size_t workspace_size); size_t zstd_compress_cctx(zstd_cctx *cctx, void *dst, size_t dst_capacity, const void *src, size_t src_size, const zstd_parameters *parameters); +/** + * zstd_create_cctx_advanced() - Create compression context + * @custom_mem: Custom allocator. + * + * Return: NULL on error, pointer to compression context otherwise. + */ +zstd_cctx *zstd_create_cctx_advanced(zstd_custom_mem custom_mem); + +/** + * zstd_free_cctx() - Free compression context + * @cdict: Pointer to compression context. + * + * Return: Always 0. + */ +size_t zstd_free_cctx(zstd_cctx* cctx); + +/** + * struct zstd_cdict - Compression dictionary. + * See zstd_lib.h. + */ +typedef ZSTD_CDict zstd_cdict; + +/** + * zstd_create_cdict_byreference() - Create compression dictionary + * @dict: Pointer to dictionary buffer. + * @dict_size: Size of the dictionary buffer. + * @dict_load_method: Dictionary load method. + * @dict_content_type: Dictionary content type. + * @custom_mem: Memory allocator. + * + * Note, this uses @dict by reference (ZSTD_dlm_byRef), so it should be + * free before zstd_cdict is destroyed. + * + * Return: NULL on error, pointer to compression dictionary + * otherwise. + */ +zstd_cdict *zstd_create_cdict_byreference(const void *dict, size_t dict_size, + zstd_compression_parameters cparams, + zstd_custom_mem custom_mem); + +/** + * zstd_free_cdict() - Free compression dictionary + * @cdict: Pointer to compression dictionary. + * + * Return: Always 0. + */ +size_t zstd_free_cdict(zstd_cdict* cdict); + +/** + * zstd_compress_using_cdict() - compress src into dst using a dictionary + * @cctx: The context. Must have been initialized with zstd_init_cctx(). + * @dst: The buffer to compress src into. + * @dst_capacity: The size of the destination buffer. May be any size, but + * ZSTD_compressBound(srcSize) is guaranteed to be large enough. + * @src: The data to compress. + * @src_size: The size of the data to compress. + * @cdict: The dictionary to be used. + * + * Return: The compressed size or an error, which can be checked using + * zstd_is_error(). + */ +size_t zstd_compress_using_cdict(zstd_cctx *cctx, void *dst, + size_t dst_capacity, const void *src, size_t src_size, + const zstd_cdict *cdict); + /* ====== Single-pass Decompression ====== */ typedef ZSTD_DCtx zstd_dctx; @@ -220,6 +322,71 @@ zstd_dctx *zstd_init_dctx(void *workspace, size_t workspace_size); size_t zstd_decompress_dctx(zstd_dctx *dctx, void *dst, size_t dst_capacity, const void *src, size_t src_size); +/** + * struct zstd_ddict - Decompression dictionary. + * See zstd_lib.h. + */ +typedef ZSTD_DDict zstd_ddict; + +/** + * zstd_create_ddict_byreference() - Create decompression dictionary + * @dict: Pointer to dictionary buffer. + * @dict_size: Size of the dictionary buffer. + * @dict_load_method: Dictionary load method. + * @dict_content_type: Dictionary content type. + * @custom_mem: Memory allocator. + * + * Note, this uses @dict by reference (ZSTD_dlm_byRef), so it should be + * free before zstd_ddict is destroyed. + * + * Return: NULL on error, pointer to decompression dictionary + * otherwise. + */ +zstd_ddict *zstd_create_ddict_byreference(const void *dict, size_t dict_size, + zstd_custom_mem custom_mem); +/** + * zstd_free_ddict() - Free decompression dictionary + * @dict: Pointer to the dictionary. + * + * Return: Always 0. + */ +size_t zstd_free_ddict(zstd_ddict *ddict); + +/** + * zstd_create_dctx_advanced() - Create decompression context + * @custom_mem: Custom allocator. + * + * Return: NULL on error, pointer to decompression context otherwise. + */ +zstd_dctx *zstd_create_dctx_advanced(zstd_custom_mem custom_mem); + +/** + * zstd_free_dctx() -- Free decompression context + * @dctx: Pointer to decompression context. + * Return: Always 0. + */ +size_t zstd_free_dctx(zstd_dctx *dctx); + +/** + * zstd_decompress_using_ddict() - decompress src into dst using a dictionary + * @dctx: The decompression context. + * @dst: The buffer to decompress src into. + * @dst_capacity: The size of the destination buffer. Must be at least as large + * as the decompressed size. If the caller cannot upper bound the + * decompressed size, then it's better to use the streaming API. + * @src: The zstd compressed data to decompress. Multiple concatenated + * frames and skippable frames are allowed. + * @src_size: The exact size of the data to decompress. + * @ddict: The dictionary to be used. + * + * Return: The decompressed size or an error, which can be checked using + * zstd_is_error(). + */ +size_t zstd_decompress_using_ddict(zstd_dctx *dctx, + void *dst, size_t dst_capacity, const void *src, size_t src_size, + const zstd_ddict *ddict); + + /* ====== Streaming Buffers ====== */ /** diff --git a/lib/zstd/zstd_compress_module.c b/lib/zstd/zstd_compress_module.c index 04e1b5c01d9b..bd8784449b31 100644 --- a/lib/zstd/zstd_compress_module.c +++ b/lib/zstd/zstd_compress_module.c @@ -66,6 +66,12 @@ int zstd_max_clevel(void) } EXPORT_SYMBOL(zstd_max_clevel); +int zstd_default_clevel(void) +{ + return ZSTD_defaultCLevel(); +} +EXPORT_SYMBOL(zstd_default_clevel); + size_t zstd_compress_bound(size_t src_size) { return ZSTD_compressBound(src_size); @@ -79,6 +85,13 @@ zstd_parameters zstd_get_params(int level, } EXPORT_SYMBOL(zstd_get_params); +zstd_compression_parameters zstd_get_cparams(int level, + unsigned long long estimated_src_size, size_t dict_size) +{ + return ZSTD_getCParams(level, estimated_src_size, dict_size); +} +EXPORT_SYMBOL(zstd_get_cparams); + size_t zstd_cctx_workspace_bound(const zstd_compression_parameters *cparams) { return ZSTD_estimateCCtxSize_usingCParams(*cparams); @@ -93,6 +106,33 @@ zstd_cctx *zstd_init_cctx(void *workspace, size_t workspace_size) } EXPORT_SYMBOL(zstd_init_cctx); +zstd_cctx *zstd_create_cctx_advanced(zstd_custom_mem custom_mem) +{ + return ZSTD_createCCtx_advanced(custom_mem); +} +EXPORT_SYMBOL(zstd_create_cctx_advanced); + +size_t zstd_free_cctx(zstd_cctx *cctx) +{ + return ZSTD_freeCCtx(cctx); +} +EXPORT_SYMBOL(zstd_free_cctx); + +zstd_cdict *zstd_create_cdict_byreference(const void *dict, size_t dict_size, + zstd_compression_parameters cparams, + zstd_custom_mem custom_mem) +{ + return ZSTD_createCDict_advanced(dict, dict_size, ZSTD_dlm_byRef, + ZSTD_dct_auto, cparams, custom_mem); +} +EXPORT_SYMBOL(zstd_create_cdict_byreference); + +size_t zstd_free_cdict(zstd_cdict *cdict) +{ + return ZSTD_freeCDict(cdict); +} +EXPORT_SYMBOL(zstd_free_cdict); + size_t zstd_compress_cctx(zstd_cctx *cctx, void *dst, size_t dst_capacity, const void *src, size_t src_size, const zstd_parameters *parameters) { @@ -101,6 +141,15 @@ size_t zstd_compress_cctx(zstd_cctx *cctx, void *dst, size_t dst_capacity, } EXPORT_SYMBOL(zstd_compress_cctx); +size_t zstd_compress_using_cdict(zstd_cctx *cctx, void *dst, + size_t dst_capacity, const void *src, size_t src_size, + const ZSTD_CDict *cdict) +{ + return ZSTD_compress_usingCDict(cctx, dst, dst_capacity, + src, src_size, cdict); +} +EXPORT_SYMBOL(zstd_compress_using_cdict); + size_t zstd_cstream_workspace_bound(const zstd_compression_parameters *cparams) { return ZSTD_estimateCStreamSize_usingCParams(*cparams); diff --git a/lib/zstd/zstd_decompress_module.c b/lib/zstd/zstd_decompress_module.c index f4ed952ed485..469fc3059be0 100644 --- a/lib/zstd/zstd_decompress_module.c +++ b/lib/zstd/zstd_decompress_module.c @@ -44,6 +44,33 @@ size_t zstd_dctx_workspace_bound(void) } EXPORT_SYMBOL(zstd_dctx_workspace_bound); +zstd_dctx *zstd_create_dctx_advanced(zstd_custom_mem custom_mem) +{ + return ZSTD_createDCtx_advanced(custom_mem); +} +EXPORT_SYMBOL(zstd_create_dctx_advanced); + +size_t zstd_free_dctx(zstd_dctx *dctx) +{ + return ZSTD_freeDCtx(dctx); +} +EXPORT_SYMBOL(zstd_free_dctx); + +zstd_ddict *zstd_create_ddict_byreference(const void *dict, size_t dict_size, + zstd_custom_mem custom_mem) +{ + return ZSTD_createDDict_advanced(dict, dict_size, ZSTD_dlm_byRef, + ZSTD_dct_auto, custom_mem); + +} +EXPORT_SYMBOL(zstd_create_ddict_byreference); + +size_t zstd_free_ddict(zstd_ddict *ddict) +{ + return ZSTD_freeDDict(ddict); +} +EXPORT_SYMBOL(zstd_free_ddict); + zstd_dctx *zstd_init_dctx(void *workspace, size_t workspace_size) { if (workspace == NULL) @@ -59,6 +86,15 @@ size_t zstd_decompress_dctx(zstd_dctx *dctx, void *dst, size_t dst_capacity, } EXPORT_SYMBOL(zstd_decompress_dctx); +size_t zstd_decompress_using_ddict(zstd_dctx *dctx, + void *dst, size_t dst_capacity, const void* src, size_t src_size, + const zstd_ddict* ddict) +{ + return ZSTD_decompress_usingDDict(dctx, dst, dst_capacity, src, + src_size, ddict); +} +EXPORT_SYMBOL(zstd_decompress_using_ddict); + size_t zstd_dstream_workspace_bound(size_t max_window_size) { return ZSTD_estimateDStreamSize(max_window_size); From 751884743025deeb52f3a2ad39acaf0f57e6d496 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:50 +0900 Subject: [PATCH 332/416] lib: lz4hc: export LZ4_resetStreamHC symbol This symbol is needed to enable lz4hc dictionary support. Link: https://lkml.kernel.org/r/20240902105656.1383858-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Nick Terrell Cc: Minchan Kim Signed-off-by: Andrew Morton --- lib/lz4/lz4hc_compress.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/lz4/lz4hc_compress.c b/lib/lz4/lz4hc_compress.c index e7ac8694b797..bc45594ad2a8 100644 --- a/lib/lz4/lz4hc_compress.c +++ b/lib/lz4/lz4hc_compress.c @@ -621,6 +621,7 @@ void LZ4_resetStreamHC(LZ4_streamHC_t *LZ4_streamHCPtr, int compressionLevel) LZ4_streamHCPtr->internal_donotuse.base = NULL; LZ4_streamHCPtr->internal_donotuse.compressionLevel = (unsigned int)compressionLevel; } +EXPORT_SYMBOL(LZ4_resetStreamHC); int LZ4_loadDictHC(LZ4_streamHC_t *LZ4_streamHCPtr, const char *dictionary, From f3c11cf5cae044668f888a50abb37b29600ca197 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:51 +0900 Subject: [PATCH 333/416] lib: zstd: fix null-deref in ZSTD_createCDict_advanced2() ZSTD_createCDict_advanced2() must ensure that ZSTD_createCDict_advanced_internal() has successfully allocated cdict. customMalloc() may be called under low memory condition and may be unable to allocate workspace for cdict. Link: https://lkml.kernel.org/r/20240902105656.1383858-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Nick Terrell Cc: Minchan Kim Signed-off-by: Andrew Morton --- lib/zstd/compress/zstd_compress.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/zstd/compress/zstd_compress.c b/lib/zstd/compress/zstd_compress.c index f620cafca633..16bb995bc6c4 100644 --- a/lib/zstd/compress/zstd_compress.c +++ b/lib/zstd/compress/zstd_compress.c @@ -4810,6 +4810,8 @@ ZSTD_CDict* ZSTD_createCDict_advanced2( dictLoadMethod, cctxParams.cParams, cctxParams.useRowMatchFinder, cctxParams.enableDedicatedDictSearch, customMem); + if (!cdict) + return NULL; if (ZSTD_isError( ZSTD_initCDict_internal(cdict, dict, dictSize, From 917a59e81c342f47a45be8af7792dae64d317984 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:52 +0900 Subject: [PATCH 334/416] zram: introduce custom comp backends API Moving to custom backends implementation gives us ability to have our own minimalistic and extendable API, and algorithms tunings becomes possible. The list of compression backends is empty at this point, we will add backends in the followup patches. Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 11 +- drivers/block/zram/Kconfig | 39 +---- drivers/block/zram/zcomp.c | 150 +++++++------------- drivers/block/zram/zcomp.h | 29 ++-- drivers/block/zram/zram_drv.c | 9 +- 5 files changed, 78 insertions(+), 160 deletions(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 091e8bb38887..181d55d64326 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -102,15 +102,8 @@ Examples:: #select lzo compression algorithm echo lzo > /sys/block/zram0/comp_algorithm -For the time being, the `comp_algorithm` content does not necessarily -show every compression algorithm supported by the kernel. We keep this -list primarily to simplify device configuration and one can configure -a new device with a compression algorithm that is not listed in -`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API -and, if some of the algorithms were built as modules, it's impossible -to list all of them using, for instance, /proc/crypto or any other -method. This, however, has an advantage of permitting the usage of -custom crypto compression modules (implementing S/W or H/W compression). +For the time being, the `comp_algorithm` content shows only compression +algorithms that are supported by zram. 4) Set Disksize =============== diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 7b29cce60ab2..8ecb74f83a5e 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -2,7 +2,6 @@ config ZRAM tristate "Compressed RAM block device support" depends on BLOCK && SYSFS && MMU - depends on CRYPTO_LZO || CRYPTO_ZSTD || CRYPTO_LZ4 || CRYPTO_LZ4HC || CRYPTO_842 select ZSMALLOC help Creates virtual block devices called /dev/zramX (X = 0, 1, ...). @@ -15,45 +14,9 @@ config ZRAM See Documentation/admin-guide/blockdev/zram.rst for more information. -choice - prompt "Default zram compressor" - default ZRAM_DEF_COMP_LZORLE - depends on ZRAM - -config ZRAM_DEF_COMP_LZORLE - bool "lzo-rle" - depends on CRYPTO_LZO - -config ZRAM_DEF_COMP_ZSTD - bool "zstd" - depends on CRYPTO_ZSTD - -config ZRAM_DEF_COMP_LZ4 - bool "lz4" - depends on CRYPTO_LZ4 - -config ZRAM_DEF_COMP_LZO - bool "lzo" - depends on CRYPTO_LZO - -config ZRAM_DEF_COMP_LZ4HC - bool "lz4hc" - depends on CRYPTO_LZ4HC - -config ZRAM_DEF_COMP_842 - bool "842" - depends on CRYPTO_842 - -endchoice - config ZRAM_DEF_COMP string - default "lzo-rle" if ZRAM_DEF_COMP_LZORLE - default "zstd" if ZRAM_DEF_COMP_ZSTD - default "lz4" if ZRAM_DEF_COMP_LZ4 - default "lzo" if ZRAM_DEF_COMP_LZO - default "lz4hc" if ZRAM_DEF_COMP_LZ4HC - default "842" if ZRAM_DEF_COMP_842 + default "unset-value" config ZRAM_WRITEBACK bool "Write back incompressible or idle page to backing device" diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 8237b08c49d8..fa31f1b91e0f 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -1,7 +1,4 @@ // SPDX-License-Identifier: GPL-2.0-or-later -/* - * Copyright (C) 2014 Sergey Senozhatsky. - */ #include #include @@ -15,91 +12,68 @@ #include "zcomp.h" -static const char * const backends[] = { -#if IS_ENABLED(CONFIG_CRYPTO_LZO) - "lzo", - "lzo-rle", -#endif -#if IS_ENABLED(CONFIG_CRYPTO_LZ4) - "lz4", -#endif -#if IS_ENABLED(CONFIG_CRYPTO_LZ4HC) - "lz4hc", -#endif -#if IS_ENABLED(CONFIG_CRYPTO_842) - "842", -#endif -#if IS_ENABLED(CONFIG_CRYPTO_ZSTD) - "zstd", -#endif +static const struct zcomp_ops *backends[] = { + NULL }; -static void zcomp_strm_free(struct zcomp_strm *zstrm) +static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm) { - if (!IS_ERR_OR_NULL(zstrm->tfm)) - crypto_free_comp(zstrm->tfm); + if (zstrm->ctx) + comp->ops->destroy_ctx(zstrm->ctx); vfree(zstrm->buffer); - zstrm->tfm = NULL; + zstrm->ctx = NULL; zstrm->buffer = NULL; } -/* - * Initialize zcomp_strm structure with ->tfm initialized by backend, and - * ->buffer. Return a negative value on error. - */ -static int zcomp_strm_init(struct zcomp_strm *zstrm, struct zcomp *comp) +static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm) { - zstrm->tfm = crypto_alloc_comp(comp->name, 0, 0); + zstrm->ctx = comp->ops->create_ctx(); + /* * allocate 2 pages. 1 for compressed data, plus 1 extra for the * case when compressed size is larger than the original one */ zstrm->buffer = vzalloc(2 * PAGE_SIZE); - if (IS_ERR_OR_NULL(zstrm->tfm) || !zstrm->buffer) { - zcomp_strm_free(zstrm); + if (!zstrm->ctx || !zstrm->buffer) { + zcomp_strm_free(comp, zstrm); return -ENOMEM; } return 0; } +static const struct zcomp_ops *lookup_backend_ops(const char *comp) +{ + int i = 0; + + while (backends[i]) { + if (sysfs_streq(comp, backends[i]->name)) + break; + i++; + } + return backends[i]; +} + bool zcomp_available_algorithm(const char *comp) { - /* - * Crypto does not ignore a trailing new line symbol, - * so make sure you don't supply a string containing - * one. - * This also means that we permit zcomp initialisation - * with any compressing algorithm known to crypto api. - */ - return crypto_has_comp(comp, 0, 0) == 1; + return lookup_backend_ops(comp) != NULL; } /* show available compressors */ ssize_t zcomp_available_show(const char *comp, char *buf) { - bool known_algorithm = false; ssize_t sz = 0; int i; - for (i = 0; i < ARRAY_SIZE(backends); i++) { - if (!strcmp(comp, backends[i])) { - known_algorithm = true; + for (i = 0; i < ARRAY_SIZE(backends) - 1; i++) { + if (!strcmp(comp, backends[i]->name)) { sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, - "[%s] ", backends[i]); + "[%s] ", backends[i]->name); } else { sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, - "%s ", backends[i]); + "%s ", backends[i]->name); } } - /* - * Out-of-tree module known to crypto api or a missing - * entry in `backends'. - */ - if (!known_algorithm && crypto_has_comp(comp, 0, 0) == 1) - sz += scnprintf(buf + sz, PAGE_SIZE - sz - 2, - "[%s] ", comp); - sz += scnprintf(buf + sz, PAGE_SIZE - sz, "\n"); return sz; } @@ -115,38 +89,25 @@ void zcomp_stream_put(struct zcomp *comp) local_unlock(&comp->stream->lock); } -int zcomp_compress(struct zcomp_strm *zstrm, - const void *src, unsigned int *dst_len) +int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, + const void *src, unsigned int *dst_len) { - /* - * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized - * because sometimes we can endup having a bigger compressed data - * due to various reasons: for example compression algorithms tend - * to add some padding to the compressed buffer. Speaking of padding, - * comp algorithm `842' pads the compressed length to multiple of 8 - * and returns -ENOSP when the dst memory is not big enough, which - * is not something that ZRAM wants to see. We can handle the - * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we - * receive -ERRNO from the compressing backend we can't help it - * anymore. To make `842' happy we need to tell the exact size of - * the dst buffer, zram_drv will take care of the fact that - * compressed buffer is too big. - */ - *dst_len = PAGE_SIZE * 2; + /* The dst buffer should always be 2 * PAGE_SIZE */ + size_t dlen = 2 * PAGE_SIZE; + int ret; - return crypto_comp_compress(zstrm->tfm, - src, PAGE_SIZE, - zstrm->buffer, dst_len); + ret = comp->ops->compress(zstrm->ctx, src, PAGE_SIZE, + zstrm->buffer, &dlen); + if (!ret) + *dst_len = dlen; + return ret; } -int zcomp_decompress(struct zcomp_strm *zstrm, - const void *src, unsigned int src_len, void *dst) +int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm, + const void *src, unsigned int src_len, void *dst) { - unsigned int dst_len = PAGE_SIZE; - - return crypto_comp_decompress(zstrm->tfm, - src, src_len, - dst, &dst_len); + return comp->ops->decompress(zstrm->ctx, src, src_len, + dst, PAGE_SIZE); } int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) @@ -158,7 +119,7 @@ int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) zstrm = per_cpu_ptr(comp->stream, cpu); local_lock_init(&zstrm->lock); - ret = zcomp_strm_init(zstrm, comp); + ret = zcomp_strm_init(comp, zstrm); if (ret) pr_err("Can't allocate a compression stream\n"); return ret; @@ -170,7 +131,7 @@ int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node) struct zcomp_strm *zstrm; zstrm = per_cpu_ptr(comp->stream, cpu); - zcomp_strm_free(zstrm); + zcomp_strm_free(comp, zstrm); return 0; } @@ -199,32 +160,21 @@ void zcomp_destroy(struct zcomp *comp) kfree(comp); } -/* - * search available compressors for requested algorithm. - * allocate new zcomp and initialize it. return compressing - * backend pointer or ERR_PTR if things went bad. ERR_PTR(-EINVAL) - * if requested algorithm is not supported, ERR_PTR(-ENOMEM) in - * case of allocation error, or any other error potentially - * returned by zcomp_init(). - */ struct zcomp *zcomp_create(const char *alg) { struct zcomp *comp; int error; - /* - * Crypto API will execute /sbin/modprobe if the compression module - * is not loaded yet. We must do it here, otherwise we are about to - * call /sbin/modprobe under CPU hot-plug lock. - */ - if (!zcomp_available_algorithm(alg)) - return ERR_PTR(-EINVAL); - comp = kzalloc(sizeof(struct zcomp), GFP_KERNEL); if (!comp) return ERR_PTR(-ENOMEM); - comp->name = alg; + comp->ops = lookup_backend_ops(alg); + if (!comp->ops) { + kfree(comp); + return ERR_PTR(-EINVAL); + } + error = zcomp_init(comp); if (error) { kfree(comp); diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index e9fe63da0e9b..e5eb5ec4c645 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -1,7 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ -/* - * Copyright (C) 2014 Sergey Senozhatsky. - */ #ifndef _ZCOMP_H_ #define _ZCOMP_H_ @@ -12,13 +9,26 @@ struct zcomp_strm { local_lock_t lock; /* compression/decompression buffer */ void *buffer; - struct crypto_comp *tfm; + void *ctx; +}; + +struct zcomp_ops { + int (*compress)(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len); + + int (*decompress)(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t dst_len); + + void *(*create_ctx)(void); + void (*destroy_ctx)(void *ctx); + + const char *name; }; /* dynamic per-device compression frontend */ struct zcomp { struct zcomp_strm __percpu *stream; - const char *name; + const struct zcomp_ops *ops; struct hlist_node node; }; @@ -33,10 +43,9 @@ void zcomp_destroy(struct zcomp *comp); struct zcomp_strm *zcomp_stream_get(struct zcomp *comp); void zcomp_stream_put(struct zcomp *comp); -int zcomp_compress(struct zcomp_strm *zstrm, - const void *src, unsigned int *dst_len); - -int zcomp_decompress(struct zcomp_strm *zstrm, - const void *src, unsigned int src_len, void *dst); +int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, + const void *src, unsigned int *dst_len); +int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm, + const void *src, unsigned int src_len, void *dst); #endif /* _ZCOMP_H_ */ diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index efcb8d9d274c..93042da8ccdf 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1327,7 +1327,8 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page, ret = 0; } else { dst = kmap_local_page(page); - ret = zcomp_decompress(zstrm, src, size, dst); + ret = zcomp_decompress(zram->comps[prio], zstrm, + src, size, dst); kunmap_local(dst); zcomp_stream_put(zram->comps[prio]); } @@ -1414,7 +1415,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) compress_again: zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]); src = kmap_local_page(page); - ret = zcomp_compress(zstrm, src, &comp_len); + ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm, + src, &comp_len); kunmap_local(src); if (unlikely(ret)) { @@ -1601,7 +1603,8 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page, num_recomps++; zstrm = zcomp_stream_get(zram->comps[prio]); src = kmap_local_page(page); - ret = zcomp_compress(zstrm, src, &comp_len_new); + ret = zcomp_compress(zram->comps[prio], zstrm, + src, &comp_len_new); kunmap_local(src); if (ret) { From 2152247c55b6bfc8d5178506787b1c38ea2679f1 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:53 +0900 Subject: [PATCH 335/416] zram: add lzo and lzorle compression backends support Add s/w lzo/lzorle compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 23 +++++++++++++++ drivers/block/zram/Makefile | 3 ++ drivers/block/zram/backend_lzo.c | 43 ++++++++++++++++++++++++++++ drivers/block/zram/backend_lzo.h | 10 +++++++ drivers/block/zram/backend_lzorle.c | 44 +++++++++++++++++++++++++++++ drivers/block/zram/backend_lzorle.h | 10 +++++++ drivers/block/zram/zcomp.c | 7 +++++ 7 files changed, 140 insertions(+) create mode 100644 drivers/block/zram/backend_lzo.c create mode 100644 drivers/block/zram/backend_lzo.h create mode 100644 drivers/block/zram/backend_lzorle.c create mode 100644 drivers/block/zram/backend_lzorle.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 8ecb74f83a5e..5d329a887b12 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -14,8 +14,31 @@ config ZRAM See Documentation/admin-guide/blockdev/zram.rst for more information. +config ZRAM_BACKEND_LZO + bool "lzo and lzo-rle compression support" + depends on ZRAM + select LZO_COMPRESS + select LZO_DECOMPRESS + +choice + prompt "Default zram compressor" + default ZRAM_DEF_COMP_LZORLE + depends on ZRAM + +config ZRAM_DEF_COMP_LZORLE + bool "lzo-rle" + depends on ZRAM_BACKEND_LZO + +config ZRAM_DEF_COMP_LZO + bool "lzo" + depends on ZRAM_BACKEND_LZO + +endchoice + config ZRAM_DEF_COMP string + default "lzo-rle" if ZRAM_DEF_COMP_LZORLE + default "lzo" if ZRAM_DEF_COMP_LZO default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index de9e457907b1..2a3db3368af9 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -1,4 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only + zram-y := zcomp.o zram_drv.o +zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o + obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_lzo.c b/drivers/block/zram/backend_lzo.c new file mode 100644 index 000000000000..4c9e611f6c03 --- /dev/null +++ b/drivers/block/zram/backend_lzo.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include + +#include "backend_lzo.h" + +static void *lzo_create(void) +{ + return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); +} + +static void lzo_destroy(void *ctx) +{ + kfree(ctx); +} + +static int lzo_compress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + int ret; + + ret = lzo1x_1_compress(src, src_len, dst, dst_len, ctx); + return ret == LZO_E_OK ? 0 : ret; +} + +static int lzo_decompress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t dst_len) +{ + int ret; + + ret = lzo1x_decompress_safe(src, src_len, dst, &dst_len); + return ret == LZO_E_OK ? 0 : ret; +} + +const struct zcomp_ops backend_lzo = { + .compress = lzo_compress, + .decompress = lzo_decompress, + .create_ctx = lzo_create, + .destroy_ctx = lzo_destroy, + .name = "lzo", +}; diff --git a/drivers/block/zram/backend_lzo.h b/drivers/block/zram/backend_lzo.h new file mode 100644 index 000000000000..93d54749e63c --- /dev/null +++ b/drivers/block/zram/backend_lzo.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_LZO_H__ +#define __BACKEND_LZO_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_lzo; + +#endif /* __BACKEND_LZO_H__ */ diff --git a/drivers/block/zram/backend_lzorle.c b/drivers/block/zram/backend_lzorle.c new file mode 100644 index 000000000000..19a9ce132046 --- /dev/null +++ b/drivers/block/zram/backend_lzorle.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include + +#include "backend_lzorle.h" + +static void *lzorle_create(void) +{ + return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); +} + +static void lzorle_destroy(void *ctx) +{ + kfree(ctx); +} + +static int lzorle_compress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + int ret; + + ret = lzorle1x_1_compress(src, src_len, dst, dst_len, ctx); + return ret == LZO_E_OK ? 0 : ret; +} + +static int lzorle_decompress(void *ctx, const unsigned char *src, + size_t src_len, unsigned char *dst, + size_t dst_len) +{ + int ret; + + ret = lzo1x_decompress_safe(src, src_len, dst, &dst_len); + return ret == LZO_E_OK ? 0 : ret; +} + +const struct zcomp_ops backend_lzorle = { + .compress = lzorle_compress, + .decompress = lzorle_decompress, + .create_ctx = lzorle_create, + .destroy_ctx = lzorle_destroy, + .name = "lzo-rle", +}; diff --git a/drivers/block/zram/backend_lzorle.h b/drivers/block/zram/backend_lzorle.h new file mode 100644 index 000000000000..6ecb163b09f1 --- /dev/null +++ b/drivers/block/zram/backend_lzorle.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_LZORLE_H__ +#define __BACKEND_LZORLE_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_lzorle; + +#endif /* __BACKEND_LZORLE_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index fa31f1b91e0f..a49d3e355c23 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -12,7 +12,14 @@ #include "zcomp.h" +#include "backend_lzo.h" +#include "backend_lzorle.h" + static const struct zcomp_ops *backends[] = { +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) + &backend_lzorle, + &backend_lzo, +#endif NULL }; From 22d651c3b339ec9dcf259c61e7c9eb988acfacc1 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:54 +0900 Subject: [PATCH 336/416] zram: add lz4 compression backend support Add s/w lz4 compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 11 +++++ drivers/block/zram/Makefile | 1 + drivers/block/zram/backend_lz4.c | 72 ++++++++++++++++++++++++++++++++ drivers/block/zram/backend_lz4.h | 10 +++++ drivers/block/zram/zcomp.c | 4 ++ 5 files changed, 98 insertions(+) create mode 100644 drivers/block/zram/backend_lz4.c create mode 100644 drivers/block/zram/backend_lz4.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 5d329a887b12..f1e76fc8431a 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -20,6 +20,12 @@ config ZRAM_BACKEND_LZO select LZO_COMPRESS select LZO_DECOMPRESS +config ZRAM_BACKEND_LZ4 + bool "lz4 compression support" + depends on ZRAM + select LZ4_COMPRESS + select LZ4_DECOMPRESS + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE @@ -33,12 +39,17 @@ config ZRAM_DEF_COMP_LZO bool "lzo" depends on ZRAM_BACKEND_LZO +config ZRAM_DEF_COMP_LZ4 + bool "lz4" + depends on ZRAM_BACKEND_LZ4 + endchoice config ZRAM_DEF_COMP string default "lzo-rle" if ZRAM_DEF_COMP_LZORLE default "lzo" if ZRAM_DEF_COMP_LZO + default "lz4" if ZRAM_DEF_COMP_LZ4 default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 2a3db3368af9..567f4434aee8 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -3,5 +3,6 @@ zram-y := zcomp.o zram_drv.o zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o +zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c new file mode 100644 index 000000000000..c1d19fed5af2 --- /dev/null +++ b/drivers/block/zram/backend_lz4.c @@ -0,0 +1,72 @@ +#include +#include +#include +#include + +#include "backend_lz4.h" + +struct lz4_ctx { + void *mem; + s32 level; +}; + +static void lz4_destroy(void *ctx) +{ + struct lz4_ctx *zctx = ctx; + + vfree(zctx->mem); + kfree(zctx); +} + +static void *lz4_create(void) +{ + struct lz4_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + /* @FIXME: using a hardcoded LZ4_ACCELERATION_DEFAULT for now */ + ctx->level = LZ4_ACCELERATION_DEFAULT; + ctx->mem = vmalloc(LZ4_MEM_COMPRESS); + if (!ctx->mem) + goto error; + + return ctx; +error: + lz4_destroy(ctx); + return NULL; +} + +static int lz4_compress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + struct lz4_ctx *zctx = ctx; + int ret; + + ret = LZ4_compress_fast(src, dst, src_len, *dst_len, + zctx->level, zctx->mem); + if (!ret) + return -EINVAL; + *dst_len = ret; + return 0; +} + +static int lz4_decompress(void *ctx, const unsigned char *src, + size_t src_len, unsigned char *dst, size_t dst_len) +{ + int ret; + + ret = LZ4_decompress_safe(src, dst, src_len, dst_len); + if (ret < 0) + return -EINVAL; + return 0; +} + +const struct zcomp_ops backend_lz4 = { + .compress = lz4_compress, + .decompress = lz4_decompress, + .create_ctx = lz4_create, + .destroy_ctx = lz4_destroy, + .name = "lz4", +}; diff --git a/drivers/block/zram/backend_lz4.h b/drivers/block/zram/backend_lz4.h new file mode 100644 index 000000000000..c11fa602a703 --- /dev/null +++ b/drivers/block/zram/backend_lz4.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_LZ4_H__ +#define __BACKEND_LZ4_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_lz4; + +#endif /* __BACKEND_LZ4_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index a49d3e355c23..5da64e8fc798 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -14,11 +14,15 @@ #include "backend_lzo.h" #include "backend_lzorle.h" +#include "backend_lz4.h" static const struct zcomp_ops *backends[] = { #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) &backend_lzorle, &backend_lzo, +#endif +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZ4) + &backend_lz4, #endif NULL }; From c60a4ef544464f03368ef399ccb8fc115ba9f070 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:55 +0900 Subject: [PATCH 337/416] zram: add lz4hc compression backend support Add s/w lz4hc compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 11 +++++ drivers/block/zram/Makefile | 5 ++- drivers/block/zram/backend_lz4hc.c | 72 ++++++++++++++++++++++++++++++ drivers/block/zram/backend_lz4hc.h | 10 +++++ drivers/block/zram/zcomp.c | 4 ++ 5 files changed, 100 insertions(+), 2 deletions(-) create mode 100644 drivers/block/zram/backend_lz4hc.c create mode 100644 drivers/block/zram/backend_lz4hc.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index f1e76fc8431a..0d890a8d2dc4 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -26,6 +26,12 @@ config ZRAM_BACKEND_LZ4 select LZ4_COMPRESS select LZ4_DECOMPRESS +config ZRAM_BACKEND_LZ4HC + bool "lz4hc compression support" + depends on ZRAM + select LZ4HC_COMPRESS + select LZ4_DECOMPRESS + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE @@ -43,6 +49,10 @@ config ZRAM_DEF_COMP_LZ4 bool "lz4" depends on ZRAM_BACKEND_LZ4 +config ZRAM_DEF_COMP_LZ4HC + bool "lz4hc" + depends on ZRAM_BACKEND_LZ4HC + endchoice config ZRAM_DEF_COMP @@ -50,6 +60,7 @@ config ZRAM_DEF_COMP default "lzo-rle" if ZRAM_DEF_COMP_LZORLE default "lzo" if ZRAM_DEF_COMP_LZO default "lz4" if ZRAM_DEF_COMP_LZ4 + default "lz4hc" if ZRAM_DEF_COMP_LZ4HC default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 567f4434aee8..727c9d68e3e3 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -2,7 +2,8 @@ zram-y := zcomp.o zram_drv.o -zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o -zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o +zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o +zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o +zram-$(CONFIG_ZRAM_BACKEND_LZ4HC) += backend_lz4hc.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c new file mode 100644 index 000000000000..536f7a0073c4 --- /dev/null +++ b/drivers/block/zram/backend_lz4hc.c @@ -0,0 +1,72 @@ +#include +#include +#include +#include + +#include "backend_lz4hc.h" + +struct lz4hc_ctx { + void *mem; + s32 level; +}; + +static void lz4hc_destroy(void *ctx) +{ + struct lz4hc_ctx *zctx = ctx; + + vfree(zctx->mem); + kfree(zctx); +} + +static void *lz4hc_create(void) +{ + struct lz4hc_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + /* @FIXME: using a hardcoded LZ4HC_DEFAULT_CLEVEL for now */ + ctx->level = LZ4HC_DEFAULT_CLEVEL; + ctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); + if (!ctx->mem) + goto error; + + return ctx; +error: + lz4hc_destroy(ctx); + return NULL; +} + +static int lz4hc_compress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + struct lz4hc_ctx *zctx = ctx; + int ret; + + ret = LZ4_compress_HC(src, dst, src_len, *dst_len, + zctx->level, zctx->mem); + if (!ret) + return -EINVAL; + *dst_len = ret; + return 0; +} + +static int lz4hc_decompress(void *ctx, const unsigned char *src, + size_t src_len, unsigned char *dst, size_t dst_len) +{ + int ret; + + ret = LZ4_decompress_safe(src, dst, src_len, dst_len); + if (ret < 0) + return -EINVAL; + return 0; +} + +const struct zcomp_ops backend_lz4hc = { + .compress = lz4hc_compress, + .decompress = lz4hc_decompress, + .create_ctx = lz4hc_create, + .destroy_ctx = lz4hc_destroy, + .name = "lz4hc", +}; diff --git a/drivers/block/zram/backend_lz4hc.h b/drivers/block/zram/backend_lz4hc.h new file mode 100644 index 000000000000..6de03551ed4d --- /dev/null +++ b/drivers/block/zram/backend_lz4hc.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_LZ4HC_H__ +#define __BACKEND_LZ4HC_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_lz4hc; + +#endif /* __BACKEND_LZ4HC_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 5da64e8fc798..a06cdb9aa567 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -15,6 +15,7 @@ #include "backend_lzo.h" #include "backend_lzorle.h" #include "backend_lz4.h" +#include "backend_lz4hc.h" static const struct zcomp_ops *backends[] = { #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) @@ -23,6 +24,9 @@ static const struct zcomp_ops *backends[] = { #endif #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZ4) &backend_lz4, +#endif +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZ4HC) + &backend_lz4hc, #endif NULL }; From 73e7d81abbc80b04595cc7da09dcaa05a3a92602 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:56 +0900 Subject: [PATCH 338/416] zram: add zstd compression backend support Add s/w zstd compression. Link: https://lkml.kernel.org/r/20240902105656.1383858-9-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 11 ++++ drivers/block/zram/Makefile | 1 + drivers/block/zram/backend_zstd.c | 97 +++++++++++++++++++++++++++++++ drivers/block/zram/backend_zstd.h | 10 ++++ drivers/block/zram/zcomp.c | 4 ++ 5 files changed, 123 insertions(+) create mode 100644 drivers/block/zram/backend_zstd.c create mode 100644 drivers/block/zram/backend_zstd.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 0d890a8d2dc4..71cd0d5d8f35 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -32,6 +32,12 @@ config ZRAM_BACKEND_LZ4HC select LZ4HC_COMPRESS select LZ4_DECOMPRESS +config ZRAM_BACKEND_ZSTD + bool "zstd compression support" + depends on ZRAM + select ZSTD_COMPRESS + select ZSTD_DECOMPRESS + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE @@ -53,6 +59,10 @@ config ZRAM_DEF_COMP_LZ4HC bool "lz4hc" depends on ZRAM_BACKEND_LZ4HC +config ZRAM_DEF_COMP_ZSTD + bool "zstd" + depends on ZRAM_BACKEND_ZSTD + endchoice config ZRAM_DEF_COMP @@ -61,6 +71,7 @@ config ZRAM_DEF_COMP default "lzo" if ZRAM_DEF_COMP_LZO default "lz4" if ZRAM_DEF_COMP_LZ4 default "lz4hc" if ZRAM_DEF_COMP_LZ4HC + default "zstd" if ZRAM_DEF_COMP_ZSTD default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 727c9d68e3e3..a2ca227e199c 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -5,5 +5,6 @@ zram-y := zcomp.o zram_drv.o zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o zram-$(CONFIG_ZRAM_BACKEND_LZ4HC) += backend_lz4hc.o +zram-$(CONFIG_ZRAM_BACKEND_ZSTD) += backend_zstd.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c new file mode 100644 index 000000000000..abec68d1104b --- /dev/null +++ b/drivers/block/zram/backend_zstd.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +#include "backend_zstd.h" + +struct zstd_ctx { + zstd_cctx *cctx; + zstd_dctx *dctx; + void *cctx_mem; + void *dctx_mem; + s32 level; +}; + +static void zstd_destroy(void *ctx) +{ + struct zstd_ctx *zctx = ctx; + + vfree(zctx->cctx_mem); + vfree(zctx->dctx_mem); + kfree(zctx); +} + +static void *zstd_create(void) +{ + zstd_parameters params; + struct zstd_ctx *ctx; + size_t sz; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->level = zstd_default_clevel(); + params = zstd_get_params(ctx->level, 0); + sz = zstd_cctx_workspace_bound(¶ms.cParams); + ctx->cctx_mem = vzalloc(sz); + if (!ctx->cctx_mem) + goto error; + + ctx->cctx = zstd_init_cctx(ctx->cctx_mem, sz); + if (!ctx->cctx) + goto error; + + sz = zstd_dctx_workspace_bound(); + ctx->dctx_mem = vzalloc(sz); + if (!ctx->dctx_mem) + goto error; + + ctx->dctx = zstd_init_dctx(ctx->dctx_mem, sz); + if (!ctx->dctx) + goto error; + + return ctx; + +error: + zstd_destroy(ctx); + return NULL; +} + +static int zstd_compress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + struct zstd_ctx *zctx = ctx; + const zstd_parameters params = zstd_get_params(zctx->level, 0); + size_t ret; + + ret = zstd_compress_cctx(zctx->cctx, dst, *dst_len, + src, src_len, ¶ms); + if (zstd_is_error(ret)) + return -EINVAL; + *dst_len = ret; + return 0; +} + +static int zstd_decompress(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t dst_len) +{ + struct zstd_ctx *zctx = ctx; + size_t ret; + + ret = zstd_decompress_dctx(zctx->dctx, dst, dst_len, src, src_len); + if (zstd_is_error(ret)) + return -EINVAL; + return 0; +} + +const struct zcomp_ops backend_zstd = { + .compress = zstd_compress, + .decompress = zstd_decompress, + .create_ctx = zstd_create, + .destroy_ctx = zstd_destroy, + .name = "zstd", +}; diff --git a/drivers/block/zram/backend_zstd.h b/drivers/block/zram/backend_zstd.h new file mode 100644 index 000000000000..10fdfff1ec1c --- /dev/null +++ b/drivers/block/zram/backend_zstd.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_ZSTD_H__ +#define __BACKEND_ZSTD_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_zstd; + +#endif /* __BACKEND_ZSTD_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index a06cdb9aa567..07147659f4a2 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -16,6 +16,7 @@ #include "backend_lzorle.h" #include "backend_lz4.h" #include "backend_lz4hc.h" +#include "backend_zstd.h" static const struct zcomp_ops *backends[] = { #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) @@ -27,6 +28,9 @@ static const struct zcomp_ops *backends[] = { #endif #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZ4HC) &backend_lz4hc, +#endif +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_ZSTD) + &backend_zstd, #endif NULL }; From dbf2763cec21f4e56420c3a31bf6f0ebf81b5ab7 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:57 +0900 Subject: [PATCH 339/416] zram: pass estimated src size hint to zstd zram works with PAGE_SIZE buffers, so we always know exact size of the source buffer and hence can pass estimated_src_size to zstd_get_params(). This hint on x86_64, for example, reduces the size of the work memory buffer from 1303520 bytes down to 90080 bytes. Given that compression streams are per-CPU that's quite some memory saving. Link: https://lkml.kernel.org/r/20240902105656.1383858-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_zstd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index abec68d1104b..310d970078e2 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -35,7 +35,7 @@ static void *zstd_create(void) return NULL; ctx->level = zstd_default_clevel(); - params = zstd_get_params(ctx->level, 0); + params = zstd_get_params(ctx->level, PAGE_SIZE); sz = zstd_cctx_workspace_bound(¶ms.cParams); ctx->cctx_mem = vzalloc(sz); if (!ctx->cctx_mem) @@ -65,7 +65,7 @@ static int zstd_compress(void *ctx, const unsigned char *src, size_t src_len, unsigned char *dst, size_t *dst_len) { struct zstd_ctx *zctx = ctx; - const zstd_parameters params = zstd_get_params(zctx->level, 0); + const zstd_parameters params = zstd_get_params(zctx->level, PAGE_SIZE); size_t ret; ret = zstd_compress_cctx(zctx->cctx, dst, *dst_len, From 84112e314f69e1f86d532cee0ef6ea579aeea26d Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:58 +0900 Subject: [PATCH 340/416] zram: add zlib compression backend support Add s/w zlib (inflate/deflate) compression. Link: https://lkml.kernel.org/r/20240902105656.1383858-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 11 +++ drivers/block/zram/Makefile | 1 + drivers/block/zram/backend_deflate.c | 132 +++++++++++++++++++++++++++ drivers/block/zram/backend_deflate.h | 10 ++ drivers/block/zram/zcomp.c | 4 + 5 files changed, 158 insertions(+) create mode 100644 drivers/block/zram/backend_deflate.c create mode 100644 drivers/block/zram/backend_deflate.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 71cd0d5d8f35..9dedd2edfb28 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -38,6 +38,12 @@ config ZRAM_BACKEND_ZSTD select ZSTD_COMPRESS select ZSTD_DECOMPRESS +config ZRAM_BACKEND_DEFLATE + bool "deflate compression support" + depends on ZRAM + select ZLIB_DEFLATE + select ZLIB_INFLATE + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE @@ -63,6 +69,10 @@ config ZRAM_DEF_COMP_ZSTD bool "zstd" depends on ZRAM_BACKEND_ZSTD +config ZRAM_DEF_COMP_DEFLATE + bool "deflate" + depends on ZRAM_BACKEND_DEFLATE + endchoice config ZRAM_DEF_COMP @@ -72,6 +82,7 @@ config ZRAM_DEF_COMP default "lz4" if ZRAM_DEF_COMP_LZ4 default "lz4hc" if ZRAM_DEF_COMP_LZ4HC default "zstd" if ZRAM_DEF_COMP_ZSTD + default "deflate" if ZRAM_DEF_COMP_DEFLATE default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index a2ca227e199c..266430548437 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -6,5 +6,6 @@ zram-$(CONFIG_ZRAM_BACKEND_LZO) += backend_lzorle.o backend_lzo.o zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o zram-$(CONFIG_ZRAM_BACKEND_LZ4HC) += backend_lz4hc.o zram-$(CONFIG_ZRAM_BACKEND_ZSTD) += backend_zstd.o +zram-$(CONFIG_ZRAM_BACKEND_DEFLATE) += backend_deflate.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_deflate.c b/drivers/block/zram/backend_deflate.c new file mode 100644 index 000000000000..acefb86701b9 --- /dev/null +++ b/drivers/block/zram/backend_deflate.c @@ -0,0 +1,132 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +#include "backend_deflate.h" + +/* Use the same value as crypto API */ +#define DEFLATE_DEF_WINBITS 11 +#define DEFLATE_DEF_MEMLEVEL MAX_MEM_LEVEL + +struct deflate_ctx { + struct z_stream_s cctx; + struct z_stream_s dctx; + s32 level; +}; + +static void deflate_destroy(void *ctx) +{ + struct deflate_ctx *zctx = ctx; + + if (zctx->cctx.workspace) { + zlib_deflateEnd(&zctx->cctx); + vfree(zctx->cctx.workspace); + } + if (zctx->dctx.workspace) { + zlib_inflateEnd(&zctx->dctx); + vfree(zctx->dctx.workspace); + } + kfree(zctx); +} + +static void *deflate_create(void) +{ + struct deflate_ctx *ctx; + size_t sz; + int ret; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + /* @FIXME: using a hardcoded Z_DEFAULT_COMPRESSION for now */ + ctx->level = Z_DEFAULT_COMPRESSION; + sz = zlib_deflate_workspacesize(-DEFLATE_DEF_WINBITS, MAX_MEM_LEVEL); + ctx->cctx.workspace = vzalloc(sz); + if (!ctx->cctx.workspace) + goto error; + + ret = zlib_deflateInit2(&ctx->cctx, ctx->level, Z_DEFLATED, + -DEFLATE_DEF_WINBITS, DEFLATE_DEF_MEMLEVEL, + Z_DEFAULT_STRATEGY); + if (ret != Z_OK) + goto error; + + sz = zlib_inflate_workspacesize(); + ctx->dctx.workspace = vzalloc(sz); + if (!ctx->dctx.workspace) + goto error; + + ret = zlib_inflateInit2(&ctx->dctx, -DEFLATE_DEF_WINBITS); + if (ret != Z_OK) + goto error; + + return ctx; + +error: + deflate_destroy(ctx); + return NULL; +} + +static int deflate_compress(void *ctx, const unsigned char *src, + size_t src_len, unsigned char *dst, + size_t *dst_len) +{ + struct deflate_ctx *zctx = ctx; + struct z_stream_s *deflate; + int ret; + + deflate = &zctx->cctx; + ret = zlib_deflateReset(deflate); + if (ret != Z_OK) + return -EINVAL; + + deflate->next_in = (u8 *)src; + deflate->avail_in = src_len; + deflate->next_out = (u8 *)dst; + deflate->avail_out = *dst_len; + + ret = zlib_deflate(deflate, Z_FINISH); + if (ret != Z_STREAM_END) + return -EINVAL; + + *dst_len = deflate->total_out; + return 0; +} + +static int deflate_decompress(void *ctx, const unsigned char *src, + size_t src_len, unsigned char *dst, + size_t dst_len) +{ + struct deflate_ctx *zctx = ctx; + struct z_stream_s *inflate; + int ret; + + inflate = &zctx->dctx; + + ret = zlib_inflateReset(inflate); + if (ret != Z_OK) + return -EINVAL; + + inflate->next_in = (u8 *)src; + inflate->avail_in = src_len; + inflate->next_out = (u8 *)dst; + inflate->avail_out = dst_len; + + ret = zlib_inflate(inflate, Z_SYNC_FLUSH); + if (ret != Z_STREAM_END) + return -EINVAL; + + return 0; +} + +const struct zcomp_ops backend_deflate = { + .compress = deflate_compress, + .decompress = deflate_decompress, + .create_ctx = deflate_create, + .destroy_ctx = deflate_destroy, + .name = "deflate", +}; diff --git a/drivers/block/zram/backend_deflate.h b/drivers/block/zram/backend_deflate.h new file mode 100644 index 000000000000..a39ac12b114c --- /dev/null +++ b/drivers/block/zram/backend_deflate.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_DEFLATE_H__ +#define __BACKEND_DEFLATE_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_deflate; + +#endif /* __BACKEND_DEFLATE_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 07147659f4a2..d1396fe82b1d 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -17,6 +17,7 @@ #include "backend_lz4.h" #include "backend_lz4hc.h" #include "backend_zstd.h" +#include "backend_deflate.h" static const struct zcomp_ops *backends[] = { #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) @@ -31,6 +32,9 @@ static const struct zcomp_ops *backends[] = { #endif #if IS_ENABLED(CONFIG_ZRAM_BACKEND_ZSTD) &backend_zstd, +#endif +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_DEFLATE) + &backend_deflate, #endif NULL }; From 1d3100cf148de1afb7b2282dbc5941fdc4722949 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:55:59 +0900 Subject: [PATCH 341/416] zram: add 842 compression backend support Add s/w 842 compression support. Link: https://lkml.kernel.org/r/20240902105656.1383858-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 11 ++++++ drivers/block/zram/Makefile | 1 + drivers/block/zram/backend_842.c | 68 ++++++++++++++++++++++++++++++++ drivers/block/zram/backend_842.h | 10 +++++ drivers/block/zram/zcomp.c | 4 ++ 5 files changed, 94 insertions(+) create mode 100644 drivers/block/zram/backend_842.c create mode 100644 drivers/block/zram/backend_842.h diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 9dedd2edfb28..1e0e7e5910b8 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -44,6 +44,12 @@ config ZRAM_BACKEND_DEFLATE select ZLIB_DEFLATE select ZLIB_INFLATE +config ZRAM_BACKEND_842 + bool "842 compression support" + depends on ZRAM + select 842_COMPRESS + select 842_DECOMPRESS + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE @@ -73,6 +79,10 @@ config ZRAM_DEF_COMP_DEFLATE bool "deflate" depends on ZRAM_BACKEND_DEFLATE +config ZRAM_DEF_COMP_842 + bool "842" + depends on ZRAM_BACKEND_842 + endchoice config ZRAM_DEF_COMP @@ -83,6 +93,7 @@ config ZRAM_DEF_COMP default "lz4hc" if ZRAM_DEF_COMP_LZ4HC default "zstd" if ZRAM_DEF_COMP_ZSTD default "deflate" if ZRAM_DEF_COMP_DEFLATE + default "842" if ZRAM_DEF_COMP_842 default "unset-value" config ZRAM_WRITEBACK diff --git a/drivers/block/zram/Makefile b/drivers/block/zram/Makefile index 266430548437..0fdefd576691 100644 --- a/drivers/block/zram/Makefile +++ b/drivers/block/zram/Makefile @@ -7,5 +7,6 @@ zram-$(CONFIG_ZRAM_BACKEND_LZ4) += backend_lz4.o zram-$(CONFIG_ZRAM_BACKEND_LZ4HC) += backend_lz4hc.o zram-$(CONFIG_ZRAM_BACKEND_ZSTD) += backend_zstd.o zram-$(CONFIG_ZRAM_BACKEND_DEFLATE) += backend_deflate.o +zram-$(CONFIG_ZRAM_BACKEND_842) += backend_842.o obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/block/zram/backend_842.c b/drivers/block/zram/backend_842.c new file mode 100644 index 000000000000..005f3fcb9024 --- /dev/null +++ b/drivers/block/zram/backend_842.c @@ -0,0 +1,68 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +#include "backend_842.h" + +struct sw842_ctx { + void *mem; +}; + +static void destroy_842(void *ctx) +{ + struct sw842_ctx *zctx = ctx; + + kfree(zctx->mem); + kfree(zctx); +} + +static void *create_842(void) +{ + struct sw842_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->mem = kmalloc(SW842_MEM_COMPRESS, GFP_KERNEL); + if (!ctx->mem) + goto error; + + return ctx; + +error: + destroy_842(ctx); + return NULL; +} + +static int compress_842(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t *dst_len) +{ + struct sw842_ctx *zctx = ctx; + unsigned int dlen = *dst_len; + int ret; + + ret = sw842_compress(src, src_len, dst, &dlen, zctx->mem); + if (ret == 0) + *dst_len = dlen; + return ret; +} + +static int decompress_842(void *ctx, const unsigned char *src, size_t src_len, + unsigned char *dst, size_t dst_len) +{ + unsigned int dlen = dst_len; + + return sw842_decompress(src, src_len, dst, &dlen); +} + +const struct zcomp_ops backend_842 = { + .compress = compress_842, + .decompress = decompress_842, + .create_ctx = create_842, + .destroy_ctx = destroy_842, + .name = "842", +}; diff --git a/drivers/block/zram/backend_842.h b/drivers/block/zram/backend_842.h new file mode 100644 index 000000000000..4dc85c188799 --- /dev/null +++ b/drivers/block/zram/backend_842.h @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#ifndef __BACKEND_842_H__ +#define __BACKEND_842_H__ + +#include "zcomp.h" + +extern const struct zcomp_ops backend_842; + +#endif /* __BACKEND_842_H__ */ diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index d1396fe82b1d..94e1c9503267 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -18,6 +18,7 @@ #include "backend_lz4hc.h" #include "backend_zstd.h" #include "backend_deflate.h" +#include "backend_842.h" static const struct zcomp_ops *backends[] = { #if IS_ENABLED(CONFIG_ZRAM_BACKEND_LZO) @@ -35,6 +36,9 @@ static const struct zcomp_ops *backends[] = { #endif #if IS_ENABLED(CONFIG_ZRAM_BACKEND_DEFLATE) &backend_deflate, +#endif +#if IS_ENABLED(CONFIG_ZRAM_BACKEND_842) + &backend_842, #endif NULL }; From 1a78390d8760c20c188f27feae74ddb44ae6f0bb Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:00 +0900 Subject: [PATCH 342/416] zram: check that backends array has at least one backend Make sure that backends array has anything apart from the sentinel NULL value. We also select LZO_BACKEND if none backends were selected. Link: https://lkml.kernel.org/r/20240902105656.1383858-13-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/Kconfig | 19 +++++++++++++------ drivers/block/zram/zcomp.c | 8 ++++++++ 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig index 1e0e7e5910b8..6aea609b795c 100644 --- a/drivers/block/zram/Kconfig +++ b/drivers/block/zram/Kconfig @@ -14,12 +14,6 @@ config ZRAM See Documentation/admin-guide/blockdev/zram.rst for more information. -config ZRAM_BACKEND_LZO - bool "lzo and lzo-rle compression support" - depends on ZRAM - select LZO_COMPRESS - select LZO_DECOMPRESS - config ZRAM_BACKEND_LZ4 bool "lz4 compression support" depends on ZRAM @@ -50,6 +44,19 @@ config ZRAM_BACKEND_842 select 842_COMPRESS select 842_DECOMPRESS +config ZRAM_BACKEND_FORCE_LZO + depends on ZRAM + def_bool !ZRAM_BACKEND_LZ4 && !ZRAM_BACKEND_LZ4HC && \ + !ZRAM_BACKEND_ZSTD && !ZRAM_BACKEND_DEFLATE && \ + !ZRAM_BACKEND_842 + +config ZRAM_BACKEND_LZO + bool "lzo and lzo-rle compression support" if !ZRAM_BACKEND_FORCE_LZO + depends on ZRAM + default ZRAM_BACKEND_FORCE_LZO + select LZO_COMPRESS + select LZO_DECOMPRESS + choice prompt "Default zram compressor" default ZRAM_DEF_COMP_LZORLE diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 94e1c9503267..76d21ee77067 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -192,6 +192,14 @@ struct zcomp *zcomp_create(const char *alg) struct zcomp *comp; int error; + /* + * The backends array has a sentinel NULL value, so the minimum + * size is 1. In order to be valid the array, apart from the + * sentinel NULL element, should have at least one compression + * backend selected. + */ + BUILD_BUG_ON(ARRAY_SIZE(backends) <= 1); + comp = kzalloc(sizeof(struct zcomp), GFP_KERNEL); if (!comp) return ERR_PTR(-ENOMEM); From f2bac7ad187d77e2a053d3cd04b158a06b683d26 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:01 +0900 Subject: [PATCH 343/416] zram: introduce zcomp_params structure We will store a per-algorithm parameters there (compression level, dictionary, dictionary size, etc.). Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_842.c | 2 +- drivers/block/zram/backend_deflate.c | 9 ++++++--- drivers/block/zram/backend_lz4.c | 9 ++++++--- drivers/block/zram/backend_lz4hc.c | 9 ++++++--- drivers/block/zram/backend_lzo.c | 2 +- drivers/block/zram/backend_lzorle.c | 2 +- drivers/block/zram/backend_zstd.c | 18 +++++++++++------- drivers/block/zram/zcomp.c | 5 +++-- drivers/block/zram/zcomp.h | 14 ++++++++++++-- drivers/block/zram/zram_drv.c | 20 +++++++++++++++++++- drivers/block/zram/zram_drv.h | 1 + 11 files changed, 67 insertions(+), 24 deletions(-) diff --git a/drivers/block/zram/backend_842.c b/drivers/block/zram/backend_842.c index 005f3fcb9024..e14dba30b0a2 100644 --- a/drivers/block/zram/backend_842.c +++ b/drivers/block/zram/backend_842.c @@ -19,7 +19,7 @@ static void destroy_842(void *ctx) kfree(zctx); } -static void *create_842(void) +static void *create_842(struct zcomp_params *params) { struct sw842_ctx *ctx; diff --git a/drivers/block/zram/backend_deflate.c b/drivers/block/zram/backend_deflate.c index acefb86701b9..ec662ce46897 100644 --- a/drivers/block/zram/backend_deflate.c +++ b/drivers/block/zram/backend_deflate.c @@ -32,7 +32,7 @@ static void deflate_destroy(void *ctx) kfree(zctx); } -static void *deflate_create(void) +static void *deflate_create(struct zcomp_params *params) { struct deflate_ctx *ctx; size_t sz; @@ -42,8 +42,11 @@ static void *deflate_create(void) if (!ctx) return NULL; - /* @FIXME: using a hardcoded Z_DEFAULT_COMPRESSION for now */ - ctx->level = Z_DEFAULT_COMPRESSION; + if (params->level != ZCOMP_PARAM_NO_LEVEL) + ctx->level = params->level; + else + ctx->level = Z_DEFAULT_COMPRESSION; + sz = zlib_deflate_workspacesize(-DEFLATE_DEF_WINBITS, MAX_MEM_LEVEL); ctx->cctx.workspace = vzalloc(sz); if (!ctx->cctx.workspace) diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c index c1d19fed5af2..ec57b5acbd39 100644 --- a/drivers/block/zram/backend_lz4.c +++ b/drivers/block/zram/backend_lz4.c @@ -18,7 +18,7 @@ static void lz4_destroy(void *ctx) kfree(zctx); } -static void *lz4_create(void) +static void *lz4_create(struct zcomp_params *params) { struct lz4_ctx *ctx; @@ -26,8 +26,11 @@ static void *lz4_create(void) if (!ctx) return NULL; - /* @FIXME: using a hardcoded LZ4_ACCELERATION_DEFAULT for now */ - ctx->level = LZ4_ACCELERATION_DEFAULT; + if (params->level != ZCOMP_PARAM_NO_LEVEL) + ctx->level = params->level; + else + ctx->level = LZ4_ACCELERATION_DEFAULT; + ctx->mem = vmalloc(LZ4_MEM_COMPRESS); if (!ctx->mem) goto error; diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c index 536f7a0073c4..80733fead595 100644 --- a/drivers/block/zram/backend_lz4hc.c +++ b/drivers/block/zram/backend_lz4hc.c @@ -18,7 +18,7 @@ static void lz4hc_destroy(void *ctx) kfree(zctx); } -static void *lz4hc_create(void) +static void *lz4hc_create(struct zcomp_params *params) { struct lz4hc_ctx *ctx; @@ -26,8 +26,11 @@ static void *lz4hc_create(void) if (!ctx) return NULL; - /* @FIXME: using a hardcoded LZ4HC_DEFAULT_CLEVEL for now */ - ctx->level = LZ4HC_DEFAULT_CLEVEL; + if (params->level != ZCOMP_PARAM_NO_LEVEL) + ctx->level = params->level; + else + ctx->level = LZ4HC_DEFAULT_CLEVEL; + ctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); if (!ctx->mem) goto error; diff --git a/drivers/block/zram/backend_lzo.c b/drivers/block/zram/backend_lzo.c index 4c9e611f6c03..8e4aabd04bf3 100644 --- a/drivers/block/zram/backend_lzo.c +++ b/drivers/block/zram/backend_lzo.c @@ -6,7 +6,7 @@ #include "backend_lzo.h" -static void *lzo_create(void) +static void *lzo_create(struct zcomp_params *params) { return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); } diff --git a/drivers/block/zram/backend_lzorle.c b/drivers/block/zram/backend_lzorle.c index 19a9ce132046..cb01eb8b04f4 100644 --- a/drivers/block/zram/backend_lzorle.c +++ b/drivers/block/zram/backend_lzorle.c @@ -6,7 +6,7 @@ #include "backend_lzorle.h" -static void *lzorle_create(void) +static void *lzorle_create(struct zcomp_params *params) { return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); } diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index 310d970078e2..c356c5e6e503 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -24,9 +24,9 @@ static void zstd_destroy(void *ctx) kfree(zctx); } -static void *zstd_create(void) +static void *zstd_create(struct zcomp_params *params) { - zstd_parameters params; + zstd_parameters prm; struct zstd_ctx *ctx; size_t sz; @@ -34,9 +34,13 @@ static void *zstd_create(void) if (!ctx) return NULL; - ctx->level = zstd_default_clevel(); - params = zstd_get_params(ctx->level, PAGE_SIZE); - sz = zstd_cctx_workspace_bound(¶ms.cParams); + if (params->level != ZCOMP_PARAM_NO_LEVEL) + ctx->level = params->level; + else + ctx->level = zstd_default_clevel(); + + prm = zstd_get_params(ctx->level, PAGE_SIZE); + sz = zstd_cctx_workspace_bound(&prm.cParams); ctx->cctx_mem = vzalloc(sz); if (!ctx->cctx_mem) goto error; @@ -65,11 +69,11 @@ static int zstd_compress(void *ctx, const unsigned char *src, size_t src_len, unsigned char *dst, size_t *dst_len) { struct zstd_ctx *zctx = ctx; - const zstd_parameters params = zstd_get_params(zctx->level, PAGE_SIZE); + const zstd_parameters prm = zstd_get_params(zctx->level, PAGE_SIZE); size_t ret; ret = zstd_compress_cctx(zctx->cctx, dst, *dst_len, - src, src_len, ¶ms); + src, src_len, &prm); if (zstd_is_error(ret)) return -EINVAL; *dst_len = ret; diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 76d21ee77067..173ee2b79442 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -54,7 +54,7 @@ static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm) static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm) { - zstrm->ctx = comp->ops->create_ctx(); + zstrm->ctx = comp->ops->create_ctx(comp->params); /* * allocate 2 pages. 1 for compressed data, plus 1 extra for the @@ -187,7 +187,7 @@ void zcomp_destroy(struct zcomp *comp) kfree(comp); } -struct zcomp *zcomp_create(const char *alg) +struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params) { struct zcomp *comp; int error; @@ -204,6 +204,7 @@ struct zcomp *zcomp_create(const char *alg) if (!comp) return ERR_PTR(-ENOMEM); + comp->params = params; comp->ops = lookup_backend_ops(alg); if (!comp->ops) { kfree(comp); diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index e5eb5ec4c645..217a750fa908 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -2,8 +2,17 @@ #ifndef _ZCOMP_H_ #define _ZCOMP_H_ + #include +#define ZCOMP_PARAM_NO_LEVEL INT_MIN + +struct zcomp_params { + void *dict; + size_t dict_sz; + s32 level; +}; + struct zcomp_strm { /* The members ->buffer and ->tfm are protected by ->lock. */ local_lock_t lock; @@ -19,7 +28,7 @@ struct zcomp_ops { int (*decompress)(void *ctx, const unsigned char *src, size_t src_len, unsigned char *dst, size_t dst_len); - void *(*create_ctx)(void); + void *(*create_ctx)(struct zcomp_params *params); void (*destroy_ctx)(void *ctx); const char *name; @@ -29,6 +38,7 @@ struct zcomp_ops { struct zcomp { struct zcomp_strm __percpu *stream; const struct zcomp_ops *ops; + struct zcomp_params *params; struct hlist_node node; }; @@ -37,7 +47,7 @@ int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node); ssize_t zcomp_available_show(const char *comp, char *buf); bool zcomp_available_algorithm(const char *comp); -struct zcomp *zcomp_create(const char *alg); +struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params); void zcomp_destroy(struct zcomp *comp); struct zcomp_strm *zcomp_stream_get(struct zcomp *comp); diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 93042da8ccdf..ff6724bbdf91 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1979,6 +1979,20 @@ static void zram_slot_free_notify(struct block_device *bdev, zram_slot_unlock(zram, index); } +static void zram_comp_params_reset(struct zram *zram) +{ + u32 prio; + + for (prio = ZRAM_PRIMARY_COMP; prio < ZRAM_MAX_COMPS; prio++) { + struct zcomp_params *params = &zram->params[prio]; + + vfree(params->dict); + params->level = ZCOMP_PARAM_NO_LEVEL; + params->dict_sz = 0; + params->dict = NULL; + } +} + static void zram_destroy_comps(struct zram *zram) { u32 prio; @@ -1992,6 +2006,8 @@ static void zram_destroy_comps(struct zram *zram) zcomp_destroy(comp); zram->num_active_comps--; } + + zram_comp_params_reset(zram); } static void zram_reset_device(struct zram *zram) @@ -2049,7 +2065,8 @@ static ssize_t disksize_store(struct device *dev, if (!zram->comp_algs[prio]) continue; - comp = zcomp_create(zram->comp_algs[prio]); + comp = zcomp_create(zram->comp_algs[prio], + &zram->params[prio]); if (IS_ERR(comp)) { pr_err("Cannot initialise %s compressing backend\n", zram->comp_algs[prio]); @@ -2254,6 +2271,7 @@ static int zram_add(void) if (ret) goto out_cleanup_disk; + zram_comp_params_reset(zram); comp_algorithm_set(zram, ZRAM_PRIMARY_COMP, default_compressor); zram_debugfs_register(zram); diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 35e322144629..b976824ead67 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -107,6 +107,7 @@ struct zram { struct zram_table_entry *table; struct zs_pool *mem_pool; struct zcomp *comps[ZRAM_MAX_COMPS]; + struct zcomp_params params[ZRAM_MAX_COMPS]; struct gendisk *disk; /* Prevent concurrent execution of device init */ struct rw_semaphore init_lock; From eb826a01909a2d39660ed5a486ebc43e831254bc Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:02 +0900 Subject: [PATCH 344/416] zram: recalculate zstd compression params once zstd compression params depends on level, but are constant for a given instance of zstd compression backend. Calculate once (during ctx creation). Link: https://lkml.kernel.org/r/20240902105656.1383858-15-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_zstd.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index c356c5e6e503..7c6798f0c912 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -10,6 +10,7 @@ struct zstd_ctx { zstd_cctx *cctx; zstd_dctx *dctx; + zstd_parameters cprm; void *cctx_mem; void *dctx_mem; s32 level; @@ -40,6 +41,7 @@ static void *zstd_create(struct zcomp_params *params) ctx->level = zstd_default_clevel(); prm = zstd_get_params(ctx->level, PAGE_SIZE); + ctx->cprm = zstd_get_params(ctx->level, PAGE_SIZE); sz = zstd_cctx_workspace_bound(&prm.cParams); ctx->cctx_mem = vzalloc(sz); if (!ctx->cctx_mem) @@ -69,11 +71,10 @@ static int zstd_compress(void *ctx, const unsigned char *src, size_t src_len, unsigned char *dst, size_t *dst_len) { struct zstd_ctx *zctx = ctx; - const zstd_parameters prm = zstd_get_params(zctx->level, PAGE_SIZE); size_t ret; ret = zstd_compress_cctx(zctx->cctx, dst, *dst_len, - src, src_len, &prm); + src, src_len, &zctx->cprm); if (zstd_is_error(ret)) return -EINVAL; *dst_len = ret; From 4eac932103a5d8c3a1bdf6776dbc1a178c31d896 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:03 +0900 Subject: [PATCH 345/416] zram: introduce algorithm_params device attribute This attribute is used to setup compression algorithms' parameters, so we can tweak algorithms' characteristics. At this point only 'level' is supported (to be extended in the future). Each call sets up parameters for one particular algorithm, which should be specified either by the algorithm's priority or algo name. This is expected to be called after corresponding algorithm is selected via comp_algorithm or recomp_algorithm. echo "priority=0 level=1" > /sys/block/zram0/algorithm_params or echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-block-zram | 7 +++ Documentation/admin-guide/blockdev/zram.rst | 1 + drivers/block/zram/zram_drv.c | 68 +++++++++++++++++++++ 3 files changed, 76 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 628a00fb20a9..1ef69e0271f9 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -151,3 +151,10 @@ Contact: Sergey Senozhatsky Description: The recompress file is write-only and triggers re-compression with secondary compression algorithms. + +What: /sys/block/zram/algorithm_params +Date: August 2024 +Contact: Sergey Senozhatsky +Description: + The algorithm_params file is write-only and is used to setup + compression algorithm parameters. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 181d55d64326..96d81dc12528 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -198,6 +198,7 @@ writeback_limit_enable RW show and set writeback_limit feature max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm +algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index ff6724bbdf91..e29e952b99c3 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -998,6 +998,72 @@ static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf) return 0; } +static int comp_params_store(struct zram *zram, u32 prio, s32 level) +{ + zram->params[prio].level = level; + return 0; +} + +static ssize_t algorithm_params_store(struct device *dev, + struct device_attribute *attr, + const char *buf, + size_t len) +{ + s32 prio = ZRAM_PRIMARY_COMP, level = ZCOMP_PARAM_NO_LEVEL; + char *args, *param, *val, *algo = NULL; + struct zram *zram = dev_to_zram(dev); + int ret; + + args = skip_spaces(buf); + while (*args) { + args = next_arg(args, ¶m, &val); + + if (!val || !*val) + return -EINVAL; + + if (!strcmp(param, "priority")) { + ret = kstrtoint(val, 10, &prio); + if (ret) + return ret; + continue; + } + + if (!strcmp(param, "level")) { + ret = kstrtoint(val, 10, &level); + if (ret) + return ret; + continue; + } + + if (!strcmp(param, "algo")) { + algo = val; + continue; + } + } + + /* Lookup priority by algorithm name */ + if (algo) { + s32 p; + + prio = -EINVAL; + for (p = ZRAM_PRIMARY_COMP; p < ZRAM_MAX_COMPS; p++) { + if (!zram->comp_algs[p]) + continue; + + if (!strcmp(zram->comp_algs[p], algo)) { + prio = p; + break; + } + } + } + + if (prio < ZRAM_PRIMARY_COMP || prio >= ZRAM_MAX_COMPS) + return -EINVAL; + + ret = comp_params_store(zram, prio, level); + return ret ? ret : len; +} + static ssize_t comp_algorithm_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -2169,6 +2235,7 @@ static DEVICE_ATTR_RW(writeback_limit_enable); static DEVICE_ATTR_RW(recomp_algorithm); static DEVICE_ATTR_WO(recompress); #endif +static DEVICE_ATTR_WO(algorithm_params); static struct attribute *zram_disk_attrs[] = { &dev_attr_disksize.attr, @@ -2196,6 +2263,7 @@ static struct attribute *zram_disk_attrs[] = { &dev_attr_recomp_algorithm.attr, &dev_attr_recompress.attr, #endif + &dev_attr_algorithm_params.attr, NULL, }; From dea77d7aea9855af321b835e4fb2fff68cd94dac Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:04 +0900 Subject: [PATCH 346/416] zram: add support for dict comp config Handle dict=path algorithm param so that we can read a pre-trained compression algorithm dictionary which we then pass to the backend configuration. Link: https://lkml.kernel.org/r/20240902105656.1383858-17-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 45 ++++++++++++++++++++++++++++------- 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index e29e952b99c3..9b29f9e6cb50 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -33,6 +33,7 @@ #include #include #include +#include #include "zram_drv.h" @@ -998,8 +999,34 @@ static int __comp_algorithm_store(struct zram *zram, u32 prio, const char *buf) return 0; } -static int comp_params_store(struct zram *zram, u32 prio, s32 level) +static void comp_params_reset(struct zram *zram, u32 prio) { + struct zcomp_params *params = &zram->params[prio]; + + vfree(params->dict); + params->level = ZCOMP_PARAM_NO_LEVEL; + params->dict_sz = 0; + params->dict = NULL; +} + +static int comp_params_store(struct zram *zram, u32 prio, s32 level, + const char *dict_path) +{ + ssize_t sz = 0; + + comp_params_reset(zram, prio); + + if (dict_path) { + sz = kernel_read_file_from_path(dict_path, 0, + &zram->params[prio].dict, + INT_MAX, + NULL, + READING_POLICY); + if (sz < 0) + return -EINVAL; + } + + zram->params[prio].dict_sz = sz; zram->params[prio].level = level; return 0; } @@ -1010,7 +1037,7 @@ static ssize_t algorithm_params_store(struct device *dev, size_t len) { s32 prio = ZRAM_PRIMARY_COMP, level = ZCOMP_PARAM_NO_LEVEL; - char *args, *param, *val, *algo = NULL; + char *args, *param, *val, *algo = NULL, *dict_path = NULL; struct zram *zram = dev_to_zram(dev); int ret; @@ -1039,6 +1066,11 @@ static ssize_t algorithm_params_store(struct device *dev, algo = val; continue; } + + if (!strcmp(param, "dict")) { + dict_path = val; + continue; + } } /* Lookup priority by algorithm name */ @@ -1060,7 +1092,7 @@ static ssize_t algorithm_params_store(struct device *dev, if (prio < ZRAM_PRIMARY_COMP || prio >= ZRAM_MAX_COMPS) return -EINVAL; - ret = comp_params_store(zram, prio, level); + ret = comp_params_store(zram, prio, level, dict_path); return ret ? ret : len; } @@ -2050,12 +2082,7 @@ static void zram_comp_params_reset(struct zram *zram) u32 prio; for (prio = ZRAM_PRIMARY_COMP; prio < ZRAM_MAX_COMPS; prio++) { - struct zcomp_params *params = &zram->params[prio]; - - vfree(params->dict); - params->level = ZCOMP_PARAM_NO_LEVEL; - params->dict_sz = 0; - params->dict = NULL; + comp_params_reset(zram, prio); } } From 52c7b4e2ba508a924c991e681db534e66a851adf Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:05 +0900 Subject: [PATCH 347/416] zram: introduce zcomp_req structure Encapsulate compression/decompression data in zcomp_req structure. Link: https://lkml.kernel.org/r/20240902105656.1383858-18-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_842.c | 17 ++++++++--------- drivers/block/zram/backend_deflate.c | 26 +++++++++++--------------- drivers/block/zram/backend_lz4.c | 15 +++++++-------- drivers/block/zram/backend_lz4hc.c | 13 ++++++------- drivers/block/zram/backend_lzo.c | 12 ++++++------ drivers/block/zram/backend_lzorle.c | 13 ++++++------- drivers/block/zram/backend_zstd.c | 15 +++++++-------- drivers/block/zram/zcomp.c | 23 ++++++++++++++++------- drivers/block/zram/zcomp.h | 15 ++++++++++----- 9 files changed, 77 insertions(+), 72 deletions(-) diff --git a/drivers/block/zram/backend_842.c b/drivers/block/zram/backend_842.c index e14dba30b0a2..1597dc4d4f1c 100644 --- a/drivers/block/zram/backend_842.c +++ b/drivers/block/zram/backend_842.c @@ -38,25 +38,24 @@ error: return NULL; } -static int compress_842(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int compress_842(void *ctx, struct zcomp_req *req) { struct sw842_ctx *zctx = ctx; - unsigned int dlen = *dst_len; + unsigned int dlen = req->dst_len; int ret; - ret = sw842_compress(src, src_len, dst, &dlen, zctx->mem); + ret = sw842_compress(req->src, req->src_len, req->dst, &dlen, + zctx->mem); if (ret == 0) - *dst_len = dlen; + req->dst_len = dlen; return ret; } -static int decompress_842(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t dst_len) +static int decompress_842(void *ctx, struct zcomp_req *req) { - unsigned int dlen = dst_len; + unsigned int dlen = req->dst_len; - return sw842_decompress(src, src_len, dst, &dlen); + return sw842_decompress(req->src, req->src_len, req->dst, &dlen); } const struct zcomp_ops backend_842 = { diff --git a/drivers/block/zram/backend_deflate.c b/drivers/block/zram/backend_deflate.c index ec662ce46897..117852d45aa4 100644 --- a/drivers/block/zram/backend_deflate.c +++ b/drivers/block/zram/backend_deflate.c @@ -74,9 +74,7 @@ error: return NULL; } -static int deflate_compress(void *ctx, const unsigned char *src, - size_t src_len, unsigned char *dst, - size_t *dst_len) +static int deflate_compress(void *ctx, struct zcomp_req *req) { struct deflate_ctx *zctx = ctx; struct z_stream_s *deflate; @@ -87,22 +85,20 @@ static int deflate_compress(void *ctx, const unsigned char *src, if (ret != Z_OK) return -EINVAL; - deflate->next_in = (u8 *)src; - deflate->avail_in = src_len; - deflate->next_out = (u8 *)dst; - deflate->avail_out = *dst_len; + deflate->next_in = (u8 *)req->src; + deflate->avail_in = req->src_len; + deflate->next_out = (u8 *)req->dst; + deflate->avail_out = req->dst_len; ret = zlib_deflate(deflate, Z_FINISH); if (ret != Z_STREAM_END) return -EINVAL; - *dst_len = deflate->total_out; + req->dst_len = deflate->total_out; return 0; } -static int deflate_decompress(void *ctx, const unsigned char *src, - size_t src_len, unsigned char *dst, - size_t dst_len) +static int deflate_decompress(void *ctx, struct zcomp_req *req) { struct deflate_ctx *zctx = ctx; struct z_stream_s *inflate; @@ -114,10 +110,10 @@ static int deflate_decompress(void *ctx, const unsigned char *src, if (ret != Z_OK) return -EINVAL; - inflate->next_in = (u8 *)src; - inflate->avail_in = src_len; - inflate->next_out = (u8 *)dst; - inflate->avail_out = dst_len; + inflate->next_in = (u8 *)req->src; + inflate->avail_in = req->src_len; + inflate->next_out = (u8 *)req->dst; + inflate->avail_out = req->dst_len; ret = zlib_inflate(inflate, Z_SYNC_FLUSH); if (ret != Z_STREAM_END) diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c index ec57b5acbd39..cc4ae4097870 100644 --- a/drivers/block/zram/backend_lz4.c +++ b/drivers/block/zram/backend_lz4.c @@ -41,26 +41,25 @@ error: return NULL; } -static int lz4_compress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int lz4_compress(void *ctx, struct zcomp_req *req) { struct lz4_ctx *zctx = ctx; int ret; - ret = LZ4_compress_fast(src, dst, src_len, *dst_len, - zctx->level, zctx->mem); + ret = LZ4_compress_fast(req->src, req->dst, req->src_len, + req->dst_len, zctx->level, zctx->mem); if (!ret) return -EINVAL; - *dst_len = ret; + req->dst_len = ret; return 0; } -static int lz4_decompress(void *ctx, const unsigned char *src, - size_t src_len, unsigned char *dst, size_t dst_len) +static int lz4_decompress(void *ctx, struct zcomp_req *req) { int ret; - ret = LZ4_decompress_safe(src, dst, src_len, dst_len); + ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, + req->dst_len); if (ret < 0) return -EINVAL; return 0; diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c index 80733fead595..610dadc09751 100644 --- a/drivers/block/zram/backend_lz4hc.c +++ b/drivers/block/zram/backend_lz4hc.c @@ -41,26 +41,25 @@ error: return NULL; } -static int lz4hc_compress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int lz4hc_compress(void *ctx, struct zcomp_req *req) { struct lz4hc_ctx *zctx = ctx; int ret; - ret = LZ4_compress_HC(src, dst, src_len, *dst_len, + ret = LZ4_compress_HC(req->src, req->dst, req->src_len, req->dst_len, zctx->level, zctx->mem); if (!ret) return -EINVAL; - *dst_len = ret; + req->dst_len = ret; return 0; } -static int lz4hc_decompress(void *ctx, const unsigned char *src, - size_t src_len, unsigned char *dst, size_t dst_len) +static int lz4hc_decompress(void *ctx, struct zcomp_req *req) { int ret; - ret = LZ4_decompress_safe(src, dst, src_len, dst_len); + ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, + req->dst_len); if (ret < 0) return -EINVAL; return 0; diff --git a/drivers/block/zram/backend_lzo.c b/drivers/block/zram/backend_lzo.c index 8e4aabd04bf3..88f1aa5683f3 100644 --- a/drivers/block/zram/backend_lzo.c +++ b/drivers/block/zram/backend_lzo.c @@ -16,21 +16,21 @@ static void lzo_destroy(void *ctx) kfree(ctx); } -static int lzo_compress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int lzo_compress(void *ctx, struct zcomp_req *req) { int ret; - ret = lzo1x_1_compress(src, src_len, dst, dst_len, ctx); + ret = lzo1x_1_compress(req->src, req->src_len, req->dst, + &req->dst_len, ctx); return ret == LZO_E_OK ? 0 : ret; } -static int lzo_decompress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t dst_len) +static int lzo_decompress(void *ctx, struct zcomp_req *req) { int ret; - ret = lzo1x_decompress_safe(src, src_len, dst, &dst_len); + ret = lzo1x_decompress_safe(req->src, req->src_len, + req->dst, &req->dst_len); return ret == LZO_E_OK ? 0 : ret; } diff --git a/drivers/block/zram/backend_lzorle.c b/drivers/block/zram/backend_lzorle.c index cb01eb8b04f4..f05820929a74 100644 --- a/drivers/block/zram/backend_lzorle.c +++ b/drivers/block/zram/backend_lzorle.c @@ -16,22 +16,21 @@ static void lzorle_destroy(void *ctx) kfree(ctx); } -static int lzorle_compress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int lzorle_compress(void *ctx, struct zcomp_req *req) { int ret; - ret = lzorle1x_1_compress(src, src_len, dst, dst_len, ctx); + ret = lzorle1x_1_compress(req->src, req->src_len, req->dst, + &req->dst_len, ctx); return ret == LZO_E_OK ? 0 : ret; } -static int lzorle_decompress(void *ctx, const unsigned char *src, - size_t src_len, unsigned char *dst, - size_t dst_len) +static int lzorle_decompress(void *ctx, struct zcomp_req *req) { int ret; - ret = lzo1x_decompress_safe(src, src_len, dst, &dst_len); + ret = lzo1x_decompress_safe(req->src, req->src_len, + req->dst, &req->dst_len); return ret == LZO_E_OK ? 0 : ret; } diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index 7c6798f0c912..19eba65f44c9 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -67,27 +67,26 @@ error: return NULL; } -static int zstd_compress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len) +static int zstd_compress(void *ctx, struct zcomp_req *req) { struct zstd_ctx *zctx = ctx; size_t ret; - ret = zstd_compress_cctx(zctx->cctx, dst, *dst_len, - src, src_len, &zctx->cprm); + ret = zstd_compress_cctx(zctx->cctx, req->dst, req->dst_len, + req->src, req->src_len, &zctx->cprm); if (zstd_is_error(ret)) return -EINVAL; - *dst_len = ret; + req->dst_len = ret; return 0; } -static int zstd_decompress(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t dst_len) +static int zstd_decompress(void *ctx, struct zcomp_req *req) { struct zstd_ctx *zctx = ctx; size_t ret; - ret = zstd_decompress_dctx(zctx->dctx, dst, dst_len, src, src_len); + ret = zstd_decompress_dctx(zctx->dctx, req->dst, req->dst_len, + req->src, req->src_len); if (zstd_is_error(ret)) return -EINVAL; return 0; diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 173ee2b79442..20ad7b6fe62f 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -119,22 +119,31 @@ void zcomp_stream_put(struct zcomp *comp) int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, const void *src, unsigned int *dst_len) { - /* The dst buffer should always be 2 * PAGE_SIZE */ - size_t dlen = 2 * PAGE_SIZE; + struct zcomp_req req = { + .src = src, + .dst = zstrm->buffer, + .src_len = PAGE_SIZE, + .dst_len = 2 * PAGE_SIZE, + }; int ret; - ret = comp->ops->compress(zstrm->ctx, src, PAGE_SIZE, - zstrm->buffer, &dlen); + ret = comp->ops->compress(zstrm->ctx, &req); if (!ret) - *dst_len = dlen; + *dst_len = req.dst_len; return ret; } int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm, const void *src, unsigned int src_len, void *dst) { - return comp->ops->decompress(zstrm->ctx, src, src_len, - dst, PAGE_SIZE); + struct zcomp_req req = { + .src = src, + .dst = dst, + .src_len = src_len, + .dst_len = PAGE_SIZE, + }; + + return comp->ops->decompress(zstrm->ctx, &req); } int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index 217a750fa908..bbc48094f826 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -21,12 +21,17 @@ struct zcomp_strm { void *ctx; }; -struct zcomp_ops { - int (*compress)(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t *dst_len); +struct zcomp_req { + const unsigned char *src; + const size_t src_len; - int (*decompress)(void *ctx, const unsigned char *src, size_t src_len, - unsigned char *dst, size_t dst_len); + unsigned char *dst; + size_t dst_len; +}; + +struct zcomp_ops { + int (*compress)(void *ctx, struct zcomp_req *req); + int (*decompress)(void *ctx, struct zcomp_req *req); void *(*create_ctx)(struct zcomp_params *params); void (*destroy_ctx)(void *ctx); From 6a81bdfeb35094c3097650306a5fda9a990d8a97 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:06 +0900 Subject: [PATCH 348/416] zram: introduce zcomp_ctx structure Keep run-time driver data (scratch buffers, etc.) in zcomp_ctx structure. This structure is allocated per-CPU because drivers (backends) need to modify its content during requests execution. We will split mutable and immutable driver data, this is a preparation path. Link: https://lkml.kernel.org/r/20240902105656.1383858-19-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_842.c | 39 ++++++-------------- drivers/block/zram/backend_deflate.c | 46 +++++++++++++----------- drivers/block/zram/backend_lz4.c | 36 ++++++++++--------- drivers/block/zram/backend_lz4hc.c | 36 ++++++++++--------- drivers/block/zram/backend_lzo.c | 17 +++++---- drivers/block/zram/backend_lzorle.c | 17 +++++---- drivers/block/zram/backend_zstd.c | 54 +++++++++++++++------------- drivers/block/zram/zcomp.c | 16 +++++---- drivers/block/zram/zcomp.h | 23 ++++++++---- 9 files changed, 149 insertions(+), 135 deletions(-) diff --git a/drivers/block/zram/backend_842.c b/drivers/block/zram/backend_842.c index 1597dc4d4f1c..2f1202322264 100644 --- a/drivers/block/zram/backend_842.c +++ b/drivers/block/zram/backend_842.c @@ -7,51 +7,32 @@ #include "backend_842.h" -struct sw842_ctx { - void *mem; -}; - -static void destroy_842(void *ctx) +static void destroy_842(struct zcomp_ctx *ctx) { - struct sw842_ctx *zctx = ctx; - - kfree(zctx->mem); - kfree(zctx); + kfree(ctx->context); } -static void *create_842(struct zcomp_params *params) +static int create_842(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct sw842_ctx *ctx; - - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return NULL; - - ctx->mem = kmalloc(SW842_MEM_COMPRESS, GFP_KERNEL); - if (!ctx->mem) - goto error; - - return ctx; - -error: - destroy_842(ctx); - return NULL; + ctx->context = kmalloc(SW842_MEM_COMPRESS, GFP_KERNEL); + if (!ctx->context) + return -ENOMEM; + return 0; } -static int compress_842(void *ctx, struct zcomp_req *req) +static int compress_842(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct sw842_ctx *zctx = ctx; unsigned int dlen = req->dst_len; int ret; ret = sw842_compress(req->src, req->src_len, req->dst, &dlen, - zctx->mem); + ctx->context); if (ret == 0) req->dst_len = dlen; return ret; } -static int decompress_842(void *ctx, struct zcomp_req *req) +static int decompress_842(struct zcomp_ctx *ctx, struct zcomp_req *req) { unsigned int dlen = req->dst_len; diff --git a/drivers/block/zram/backend_deflate.c b/drivers/block/zram/backend_deflate.c index 117852d45aa4..eae4ee35c1df 100644 --- a/drivers/block/zram/backend_deflate.c +++ b/drivers/block/zram/backend_deflate.c @@ -17,9 +17,12 @@ struct deflate_ctx { s32 level; }; -static void deflate_destroy(void *ctx) +static void deflate_destroy(struct zcomp_ctx *ctx) { - struct deflate_ctx *zctx = ctx; + struct deflate_ctx *zctx = ctx->context; + + if (!zctx) + return; if (zctx->cctx.workspace) { zlib_deflateEnd(&zctx->cctx); @@ -32,51 +35,52 @@ static void deflate_destroy(void *ctx) kfree(zctx); } -static void *deflate_create(struct zcomp_params *params) +static int deflate_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct deflate_ctx *ctx; + struct deflate_ctx *zctx; size_t sz; int ret; - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return NULL; + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) + return -ENOMEM; + ctx->context = zctx; if (params->level != ZCOMP_PARAM_NO_LEVEL) - ctx->level = params->level; + zctx->level = params->level; else - ctx->level = Z_DEFAULT_COMPRESSION; + zctx->level = Z_DEFAULT_COMPRESSION; sz = zlib_deflate_workspacesize(-DEFLATE_DEF_WINBITS, MAX_MEM_LEVEL); - ctx->cctx.workspace = vzalloc(sz); - if (!ctx->cctx.workspace) + zctx->cctx.workspace = vzalloc(sz); + if (!zctx->cctx.workspace) goto error; - ret = zlib_deflateInit2(&ctx->cctx, ctx->level, Z_DEFLATED, + ret = zlib_deflateInit2(&zctx->cctx, zctx->level, Z_DEFLATED, -DEFLATE_DEF_WINBITS, DEFLATE_DEF_MEMLEVEL, Z_DEFAULT_STRATEGY); if (ret != Z_OK) goto error; sz = zlib_inflate_workspacesize(); - ctx->dctx.workspace = vzalloc(sz); - if (!ctx->dctx.workspace) + zctx->dctx.workspace = vzalloc(sz); + if (!zctx->dctx.workspace) goto error; - ret = zlib_inflateInit2(&ctx->dctx, -DEFLATE_DEF_WINBITS); + ret = zlib_inflateInit2(&zctx->dctx, -DEFLATE_DEF_WINBITS); if (ret != Z_OK) goto error; - return ctx; + return 0; error: deflate_destroy(ctx); - return NULL; + return -EINVAL; } -static int deflate_compress(void *ctx, struct zcomp_req *req) +static int deflate_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct deflate_ctx *zctx = ctx; + struct deflate_ctx *zctx = ctx->context; struct z_stream_s *deflate; int ret; @@ -98,9 +102,9 @@ static int deflate_compress(void *ctx, struct zcomp_req *req) return 0; } -static int deflate_decompress(void *ctx, struct zcomp_req *req) +static int deflate_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct deflate_ctx *zctx = ctx; + struct deflate_ctx *zctx = ctx->context; struct z_stream_s *inflate; int ret; diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c index cc4ae4097870..e2d951e62746 100644 --- a/drivers/block/zram/backend_lz4.c +++ b/drivers/block/zram/backend_lz4.c @@ -10,40 +10,44 @@ struct lz4_ctx { s32 level; }; -static void lz4_destroy(void *ctx) +static void lz4_destroy(struct zcomp_ctx *ctx) { - struct lz4_ctx *zctx = ctx; + struct lz4_ctx *zctx = ctx->context; + + if (!zctx) + return; vfree(zctx->mem); kfree(zctx); } -static void *lz4_create(struct zcomp_params *params) +static int lz4_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct lz4_ctx *ctx; + struct lz4_ctx *zctx; - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return NULL; + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) + return -ENOMEM; + ctx->context = zctx; if (params->level != ZCOMP_PARAM_NO_LEVEL) - ctx->level = params->level; + zctx->level = params->level; else - ctx->level = LZ4_ACCELERATION_DEFAULT; + zctx->level = LZ4_ACCELERATION_DEFAULT; - ctx->mem = vmalloc(LZ4_MEM_COMPRESS); - if (!ctx->mem) + zctx->mem = vmalloc(LZ4_MEM_COMPRESS); + if (!zctx->mem) goto error; - return ctx; + return 0; error: lz4_destroy(ctx); - return NULL; + return -EINVAL; } -static int lz4_compress(void *ctx, struct zcomp_req *req) +static int lz4_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct lz4_ctx *zctx = ctx; + struct lz4_ctx *zctx = ctx->context; int ret; ret = LZ4_compress_fast(req->src, req->dst, req->src_len, @@ -54,7 +58,7 @@ static int lz4_compress(void *ctx, struct zcomp_req *req) return 0; } -static int lz4_decompress(void *ctx, struct zcomp_req *req) +static int lz4_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c index 610dadc09751..6da71ec5fc05 100644 --- a/drivers/block/zram/backend_lz4hc.c +++ b/drivers/block/zram/backend_lz4hc.c @@ -10,40 +10,44 @@ struct lz4hc_ctx { s32 level; }; -static void lz4hc_destroy(void *ctx) +static void lz4hc_destroy(struct zcomp_ctx *ctx) { - struct lz4hc_ctx *zctx = ctx; + struct lz4hc_ctx *zctx = ctx->context; + + if (!zctx) + return; vfree(zctx->mem); kfree(zctx); } -static void *lz4hc_create(struct zcomp_params *params) +static int lz4hc_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct lz4hc_ctx *ctx; + struct lz4hc_ctx *zctx; - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return NULL; + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) + return -ENOMEM; + ctx->context = zctx; if (params->level != ZCOMP_PARAM_NO_LEVEL) - ctx->level = params->level; + zctx->level = params->level; else - ctx->level = LZ4HC_DEFAULT_CLEVEL; + zctx->level = LZ4HC_DEFAULT_CLEVEL; - ctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); - if (!ctx->mem) + zctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); + if (!zctx->mem) goto error; - return ctx; + return 0; error: lz4hc_destroy(ctx); - return NULL; + return -EINVAL; } -static int lz4hc_compress(void *ctx, struct zcomp_req *req) +static int lz4hc_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct lz4hc_ctx *zctx = ctx; + struct lz4hc_ctx *zctx = ctx->context; int ret; ret = LZ4_compress_HC(req->src, req->dst, req->src_len, req->dst_len, @@ -54,7 +58,7 @@ static int lz4hc_compress(void *ctx, struct zcomp_req *req) return 0; } -static int lz4hc_decompress(void *ctx, struct zcomp_req *req) +static int lz4hc_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; diff --git a/drivers/block/zram/backend_lzo.c b/drivers/block/zram/backend_lzo.c index 88f1aa5683f3..81fbad286092 100644 --- a/drivers/block/zram/backend_lzo.c +++ b/drivers/block/zram/backend_lzo.c @@ -6,26 +6,29 @@ #include "backend_lzo.h" -static void *lzo_create(struct zcomp_params *params) +static int lzo_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); + ctx->context = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); + if (!ctx->context) + return -ENOMEM; + return 0; } -static void lzo_destroy(void *ctx) +static void lzo_destroy(struct zcomp_ctx *ctx) { - kfree(ctx); + kfree(ctx->context); } -static int lzo_compress(void *ctx, struct zcomp_req *req) +static int lzo_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; ret = lzo1x_1_compress(req->src, req->src_len, req->dst, - &req->dst_len, ctx); + &req->dst_len, ctx->context); return ret == LZO_E_OK ? 0 : ret; } -static int lzo_decompress(void *ctx, struct zcomp_req *req) +static int lzo_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; diff --git a/drivers/block/zram/backend_lzorle.c b/drivers/block/zram/backend_lzorle.c index f05820929a74..99c9da8f116e 100644 --- a/drivers/block/zram/backend_lzorle.c +++ b/drivers/block/zram/backend_lzorle.c @@ -6,26 +6,29 @@ #include "backend_lzorle.h" -static void *lzorle_create(struct zcomp_params *params) +static int lzorle_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - return kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); + ctx->context = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); + if (!ctx->context) + return -ENOMEM; + return 0; } -static void lzorle_destroy(void *ctx) +static void lzorle_destroy(struct zcomp_ctx *ctx) { - kfree(ctx); + kfree(ctx->context); } -static int lzorle_compress(void *ctx, struct zcomp_req *req) +static int lzorle_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; ret = lzorle1x_1_compress(req->src, req->src_len, req->dst, - &req->dst_len, ctx); + &req->dst_len, ctx->context); return ret == LZO_E_OK ? 0 : ret; } -static int lzorle_decompress(void *ctx, struct zcomp_req *req) +static int lzorle_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { int ret; diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index 19eba65f44c9..9f000dedf1ad 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -16,60 +16,64 @@ struct zstd_ctx { s32 level; }; -static void zstd_destroy(void *ctx) +static void zstd_destroy(struct zcomp_ctx *ctx) { - struct zstd_ctx *zctx = ctx; + struct zstd_ctx *zctx = ctx->context; + + if (!zctx) + return; vfree(zctx->cctx_mem); vfree(zctx->dctx_mem); kfree(zctx); } -static void *zstd_create(struct zcomp_params *params) +static int zstd_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { + struct zstd_ctx *zctx; zstd_parameters prm; - struct zstd_ctx *ctx; size_t sz; - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return NULL; + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) + return -ENOMEM; + ctx->context = zctx; if (params->level != ZCOMP_PARAM_NO_LEVEL) - ctx->level = params->level; + zctx->level = params->level; else - ctx->level = zstd_default_clevel(); + zctx->level = zstd_default_clevel(); - prm = zstd_get_params(ctx->level, PAGE_SIZE); - ctx->cprm = zstd_get_params(ctx->level, PAGE_SIZE); + prm = zstd_get_params(zctx->level, PAGE_SIZE); + zctx->cprm = zstd_get_params(zctx->level, PAGE_SIZE); sz = zstd_cctx_workspace_bound(&prm.cParams); - ctx->cctx_mem = vzalloc(sz); - if (!ctx->cctx_mem) + zctx->cctx_mem = vzalloc(sz); + if (!zctx->cctx_mem) goto error; - ctx->cctx = zstd_init_cctx(ctx->cctx_mem, sz); - if (!ctx->cctx) + zctx->cctx = zstd_init_cctx(zctx->cctx_mem, sz); + if (!zctx->cctx) goto error; sz = zstd_dctx_workspace_bound(); - ctx->dctx_mem = vzalloc(sz); - if (!ctx->dctx_mem) + zctx->dctx_mem = vzalloc(sz); + if (!zctx->dctx_mem) goto error; - ctx->dctx = zstd_init_dctx(ctx->dctx_mem, sz); - if (!ctx->dctx) + zctx->dctx = zstd_init_dctx(zctx->dctx_mem, sz); + if (!zctx->dctx) goto error; - return ctx; + return 0; error: zstd_destroy(ctx); - return NULL; + return -EINVAL; } -static int zstd_compress(void *ctx, struct zcomp_req *req) +static int zstd_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct zstd_ctx *zctx = ctx; + struct zstd_ctx *zctx = ctx->context; size_t ret; ret = zstd_compress_cctx(zctx->cctx, req->dst, req->dst_len, @@ -80,9 +84,9 @@ static int zstd_compress(void *ctx, struct zcomp_req *req) return 0; } -static int zstd_decompress(void *ctx, struct zcomp_req *req) +static int zstd_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) { - struct zstd_ctx *zctx = ctx; + struct zstd_ctx *zctx = ctx->context; size_t ret; ret = zstd_decompress_dctx(zctx->dctx, req->dst, req->dst_len, diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 20ad7b6fe62f..96f07287e571 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -45,23 +45,25 @@ static const struct zcomp_ops *backends[] = { static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm) { - if (zstrm->ctx) - comp->ops->destroy_ctx(zstrm->ctx); + comp->ops->destroy_ctx(&zstrm->ctx); vfree(zstrm->buffer); - zstrm->ctx = NULL; zstrm->buffer = NULL; } static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm) { - zstrm->ctx = comp->ops->create_ctx(comp->params); + int ret; + + ret = comp->ops->create_ctx(comp->params, &zstrm->ctx); + if (ret) + return ret; /* * allocate 2 pages. 1 for compressed data, plus 1 extra for the * case when compressed size is larger than the original one */ zstrm->buffer = vzalloc(2 * PAGE_SIZE); - if (!zstrm->ctx || !zstrm->buffer) { + if (!zstrm->buffer) { zcomp_strm_free(comp, zstrm); return -ENOMEM; } @@ -127,7 +129,7 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, }; int ret; - ret = comp->ops->compress(zstrm->ctx, &req); + ret = comp->ops->compress(&zstrm->ctx, &req); if (!ret) *dst_len = req.dst_len; return ret; @@ -143,7 +145,7 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm, .dst_len = PAGE_SIZE, }; - return comp->ops->decompress(zstrm->ctx, &req); + return comp->ops->decompress(&zstrm->ctx, &req); } int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index bbc48094f826..1d8920c2d449 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -13,12 +13,20 @@ struct zcomp_params { s32 level; }; +/* + * Run-time driver context - scratch buffers, etc. It is modified during + * request execution (compression/decompression), cannot be shared, so + * it's in per-CPU area. + */ +struct zcomp_ctx { + void *context; +}; + struct zcomp_strm { - /* The members ->buffer and ->tfm are protected by ->lock. */ local_lock_t lock; - /* compression/decompression buffer */ + /* compression buffer */ void *buffer; - void *ctx; + struct zcomp_ctx ctx; }; struct zcomp_req { @@ -30,11 +38,12 @@ struct zcomp_req { }; struct zcomp_ops { - int (*compress)(void *ctx, struct zcomp_req *req); - int (*decompress)(void *ctx, struct zcomp_req *req); + int (*compress)(struct zcomp_ctx *ctx, struct zcomp_req *req); + int (*decompress)(struct zcomp_ctx *ctx, struct zcomp_req *req); - void *(*create_ctx)(struct zcomp_params *params); - void (*destroy_ctx)(void *ctx); + int (*create_ctx)(struct zcomp_params *params, + struct zcomp_ctx *ctx); + void (*destroy_ctx)(struct zcomp_ctx *ctx); const char *name; }; From b8f03cb703a160e14f87a467a4443adc5d940087 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:07 +0900 Subject: [PATCH 349/416] zram: move immutable comp params away from per-CPU context Immutable params never change once comp has been allocated and setup, so we don't need to store multiple copies of them in each per-CPU backend context. Move those to per-comp zcomp_params and pass it to backends callbacks for requests execution. Basically, this means parameters sharing between different contexts. Also introduce two new backends callbacks: setup_params() and release_params(). First, we need to validate params in a driver-specific way; second, driver may want to allocate its specific representation of the params which is needed to execute requests. Link: https://lkml.kernel.org/r/20240902105656.1383858-20-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_842.c | 17 +++++++-- drivers/block/zram/backend_deflate.c | 29 ++++++++++----- drivers/block/zram/backend_lz4.c | 53 +++++++++++----------------- drivers/block/zram/backend_lz4hc.c | 53 +++++++++++----------------- drivers/block/zram/backend_lzo.c | 17 +++++++-- drivers/block/zram/backend_lzorle.c | 17 +++++++-- drivers/block/zram/backend_zstd.c | 47 +++++++++++++++++------- drivers/block/zram/zcomp.c | 17 ++++++--- drivers/block/zram/zcomp.h | 20 ++++++++--- 9 files changed, 170 insertions(+), 100 deletions(-) diff --git a/drivers/block/zram/backend_842.c b/drivers/block/zram/backend_842.c index 2f1202322264..10d9d5c60f53 100644 --- a/drivers/block/zram/backend_842.c +++ b/drivers/block/zram/backend_842.c @@ -7,6 +7,15 @@ #include "backend_842.h" +static void release_params_842(struct zcomp_params *params) +{ +} + +static int setup_params_842(struct zcomp_params *params) +{ + return 0; +} + static void destroy_842(struct zcomp_ctx *ctx) { kfree(ctx->context); @@ -20,7 +29,8 @@ static int create_842(struct zcomp_params *params, struct zcomp_ctx *ctx) return 0; } -static int compress_842(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int compress_842(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { unsigned int dlen = req->dst_len; int ret; @@ -32,7 +42,8 @@ static int compress_842(struct zcomp_ctx *ctx, struct zcomp_req *req) return ret; } -static int decompress_842(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int decompress_842(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { unsigned int dlen = req->dst_len; @@ -44,5 +55,7 @@ const struct zcomp_ops backend_842 = { .decompress = decompress_842, .create_ctx = create_842, .destroy_ctx = destroy_842, + .setup_params = setup_params_842, + .release_params = release_params_842, .name = "842", }; diff --git a/drivers/block/zram/backend_deflate.c b/drivers/block/zram/backend_deflate.c index eae4ee35c1df..0f7f252c12f4 100644 --- a/drivers/block/zram/backend_deflate.c +++ b/drivers/block/zram/backend_deflate.c @@ -14,9 +14,20 @@ struct deflate_ctx { struct z_stream_s cctx; struct z_stream_s dctx; - s32 level; }; +static void deflate_release_params(struct zcomp_params *params) +{ +} + +static int deflate_setup_params(struct zcomp_params *params) +{ + if (params->level == ZCOMP_PARAM_NO_LEVEL) + params->level = Z_DEFAULT_COMPRESSION; + + return 0; +} + static void deflate_destroy(struct zcomp_ctx *ctx) { struct deflate_ctx *zctx = ctx->context; @@ -46,17 +57,12 @@ static int deflate_create(struct zcomp_params *params, struct zcomp_ctx *ctx) return -ENOMEM; ctx->context = zctx; - if (params->level != ZCOMP_PARAM_NO_LEVEL) - zctx->level = params->level; - else - zctx->level = Z_DEFAULT_COMPRESSION; - sz = zlib_deflate_workspacesize(-DEFLATE_DEF_WINBITS, MAX_MEM_LEVEL); zctx->cctx.workspace = vzalloc(sz); if (!zctx->cctx.workspace) goto error; - ret = zlib_deflateInit2(&zctx->cctx, zctx->level, Z_DEFLATED, + ret = zlib_deflateInit2(&zctx->cctx, params->level, Z_DEFLATED, -DEFLATE_DEF_WINBITS, DEFLATE_DEF_MEMLEVEL, Z_DEFAULT_STRATEGY); if (ret != Z_OK) @@ -78,7 +84,8 @@ error: return -EINVAL; } -static int deflate_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int deflate_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { struct deflate_ctx *zctx = ctx->context; struct z_stream_s *deflate; @@ -102,7 +109,9 @@ static int deflate_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) return 0; } -static int deflate_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int deflate_decompress(struct zcomp_params *params, + struct zcomp_ctx *ctx, + struct zcomp_req *req) { struct deflate_ctx *zctx = ctx->context; struct z_stream_s *inflate; @@ -131,5 +140,7 @@ const struct zcomp_ops backend_deflate = { .decompress = deflate_decompress, .create_ctx = deflate_create, .destroy_ctx = deflate_destroy, + .setup_params = deflate_setup_params, + .release_params = deflate_release_params, .name = "deflate", }; diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c index e2d951e62746..cf3c029bd5ad 100644 --- a/drivers/block/zram/backend_lz4.c +++ b/drivers/block/zram/backend_lz4.c @@ -5,60 +5,47 @@ #include "backend_lz4.h" -struct lz4_ctx { - void *mem; - s32 level; -}; +static void lz4_release_params(struct zcomp_params *params) +{ +} + +static int lz4_setup_params(struct zcomp_params *params) +{ + if (params->level == ZCOMP_PARAM_NO_LEVEL) + params->level = LZ4_ACCELERATION_DEFAULT; + + return 0; +} static void lz4_destroy(struct zcomp_ctx *ctx) { - struct lz4_ctx *zctx = ctx->context; - - if (!zctx) - return; - - vfree(zctx->mem); - kfree(zctx); + vfree(ctx->context); } static int lz4_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct lz4_ctx *zctx; - - zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); - if (!zctx) + ctx->context = vmalloc(LZ4_MEM_COMPRESS); + if (!ctx->context) return -ENOMEM; - ctx->context = zctx; - if (params->level != ZCOMP_PARAM_NO_LEVEL) - zctx->level = params->level; - else - zctx->level = LZ4_ACCELERATION_DEFAULT; - - zctx->mem = vmalloc(LZ4_MEM_COMPRESS); - if (!zctx->mem) - goto error; - return 0; -error: - lz4_destroy(ctx); - return -EINVAL; } -static int lz4_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lz4_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { - struct lz4_ctx *zctx = ctx->context; int ret; ret = LZ4_compress_fast(req->src, req->dst, req->src_len, - req->dst_len, zctx->level, zctx->mem); + req->dst_len, params->level, ctx->context); if (!ret) return -EINVAL; req->dst_len = ret; return 0; } -static int lz4_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lz4_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -74,5 +61,7 @@ const struct zcomp_ops backend_lz4 = { .decompress = lz4_decompress, .create_ctx = lz4_create, .destroy_ctx = lz4_destroy, + .setup_params = lz4_setup_params, + .release_params = lz4_release_params, .name = "lz4", }; diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c index 6da71ec5fc05..928a6ea78668 100644 --- a/drivers/block/zram/backend_lz4hc.c +++ b/drivers/block/zram/backend_lz4hc.c @@ -5,60 +5,47 @@ #include "backend_lz4hc.h" -struct lz4hc_ctx { - void *mem; - s32 level; -}; +static void lz4hc_release_params(struct zcomp_params *params) +{ +} + +static int lz4hc_setup_params(struct zcomp_params *params) +{ + if (params->level == ZCOMP_PARAM_NO_LEVEL) + params->level = LZ4HC_DEFAULT_CLEVEL; + + return 0; +} static void lz4hc_destroy(struct zcomp_ctx *ctx) { - struct lz4hc_ctx *zctx = ctx->context; - - if (!zctx) - return; - - vfree(zctx->mem); - kfree(zctx); + vfree(ctx->context); } static int lz4hc_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - struct lz4hc_ctx *zctx; - - zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); - if (!zctx) + ctx->context = vmalloc(LZ4HC_MEM_COMPRESS); + if (!ctx->context) return -ENOMEM; - ctx->context = zctx; - if (params->level != ZCOMP_PARAM_NO_LEVEL) - zctx->level = params->level; - else - zctx->level = LZ4HC_DEFAULT_CLEVEL; - - zctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); - if (!zctx->mem) - goto error; - return 0; -error: - lz4hc_destroy(ctx); - return -EINVAL; } -static int lz4hc_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lz4hc_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { - struct lz4hc_ctx *zctx = ctx->context; int ret; ret = LZ4_compress_HC(req->src, req->dst, req->src_len, req->dst_len, - zctx->level, zctx->mem); + params->level, ctx->context); if (!ret) return -EINVAL; req->dst_len = ret; return 0; } -static int lz4hc_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lz4hc_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -74,5 +61,7 @@ const struct zcomp_ops backend_lz4hc = { .decompress = lz4hc_decompress, .create_ctx = lz4hc_create, .destroy_ctx = lz4hc_destroy, + .setup_params = lz4hc_setup_params, + .release_params = lz4hc_release_params, .name = "lz4hc", }; diff --git a/drivers/block/zram/backend_lzo.c b/drivers/block/zram/backend_lzo.c index 81fbad286092..4c906beaae6b 100644 --- a/drivers/block/zram/backend_lzo.c +++ b/drivers/block/zram/backend_lzo.c @@ -6,6 +6,15 @@ #include "backend_lzo.h" +static void lzo_release_params(struct zcomp_params *params) +{ +} + +static int lzo_setup_params(struct zcomp_params *params) +{ + return 0; +} + static int lzo_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { ctx->context = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); @@ -19,7 +28,8 @@ static void lzo_destroy(struct zcomp_ctx *ctx) kfree(ctx->context); } -static int lzo_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lzo_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -28,7 +38,8 @@ static int lzo_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) return ret == LZO_E_OK ? 0 : ret; } -static int lzo_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lzo_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -42,5 +53,7 @@ const struct zcomp_ops backend_lzo = { .decompress = lzo_decompress, .create_ctx = lzo_create, .destroy_ctx = lzo_destroy, + .setup_params = lzo_setup_params, + .release_params = lzo_release_params, .name = "lzo", }; diff --git a/drivers/block/zram/backend_lzorle.c b/drivers/block/zram/backend_lzorle.c index 99c9da8f116e..10640c96cbfc 100644 --- a/drivers/block/zram/backend_lzorle.c +++ b/drivers/block/zram/backend_lzorle.c @@ -6,6 +6,15 @@ #include "backend_lzorle.h" +static void lzorle_release_params(struct zcomp_params *params) +{ +} + +static int lzorle_setup_params(struct zcomp_params *params) +{ + return 0; +} + static int lzorle_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { ctx->context = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); @@ -19,7 +28,8 @@ static void lzorle_destroy(struct zcomp_ctx *ctx) kfree(ctx->context); } -static int lzorle_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lzorle_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -28,7 +38,8 @@ static int lzorle_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) return ret == LZO_E_OK ? 0 : ret; } -static int lzorle_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int lzorle_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { int ret; @@ -42,5 +53,7 @@ const struct zcomp_ops backend_lzorle = { .decompress = lzorle_decompress, .create_ctx = lzorle_create, .destroy_ctx = lzorle_destroy, + .setup_params = lzorle_setup_params, + .release_params = lzorle_release_params, .name = "lzo-rle", }; diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index 9f000dedf1ad..5b33daf4f645 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -10,12 +10,36 @@ struct zstd_ctx { zstd_cctx *cctx; zstd_dctx *dctx; - zstd_parameters cprm; void *cctx_mem; void *dctx_mem; - s32 level; }; +struct zstd_params { + zstd_parameters cprm; +}; + +static void zstd_release_params(struct zcomp_params *params) +{ + kfree(params->drv_data); +} + +static int zstd_setup_params(struct zcomp_params *params) +{ + struct zstd_params *zp; + + zp = kzalloc(sizeof(*zp), GFP_KERNEL); + if (!zp) + return -ENOMEM; + + if (params->level == ZCOMP_PARAM_NO_LEVEL) + params->level = zstd_default_clevel(); + + zp->cprm = zstd_get_params(params->level, PAGE_SIZE); + params->drv_data = zp; + + return 0; +} + static void zstd_destroy(struct zcomp_ctx *ctx) { struct zstd_ctx *zctx = ctx->context; @@ -39,13 +63,7 @@ static int zstd_create(struct zcomp_params *params, struct zcomp_ctx *ctx) return -ENOMEM; ctx->context = zctx; - if (params->level != ZCOMP_PARAM_NO_LEVEL) - zctx->level = params->level; - else - zctx->level = zstd_default_clevel(); - - prm = zstd_get_params(zctx->level, PAGE_SIZE); - zctx->cprm = zstd_get_params(zctx->level, PAGE_SIZE); + prm = zstd_get_params(params->level, PAGE_SIZE); sz = zstd_cctx_workspace_bound(&prm.cParams); zctx->cctx_mem = vzalloc(sz); if (!zctx->cctx_mem) @@ -71,20 +89,23 @@ error: return -EINVAL; } -static int zstd_compress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int zstd_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { + struct zstd_params *zp = params->drv_data; struct zstd_ctx *zctx = ctx->context; size_t ret; ret = zstd_compress_cctx(zctx->cctx, req->dst, req->dst_len, - req->src, req->src_len, &zctx->cprm); + req->src, req->src_len, &zp->cprm); if (zstd_is_error(ret)) return -EINVAL; req->dst_len = ret; return 0; } -static int zstd_decompress(struct zcomp_ctx *ctx, struct zcomp_req *req) +static int zstd_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req) { struct zstd_ctx *zctx = ctx->context; size_t ret; @@ -101,5 +122,7 @@ const struct zcomp_ops backend_zstd = { .decompress = zstd_decompress, .create_ctx = zstd_create, .destroy_ctx = zstd_destroy, + .setup_params = zstd_setup_params, + .release_params = zstd_release_params, .name = "zstd", }; diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 96f07287e571..bb514403e305 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -129,7 +129,7 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm, }; int ret; - ret = comp->ops->compress(&zstrm->ctx, &req); + ret = comp->ops->compress(comp->params, &zstrm->ctx, &req); if (!ret) *dst_len = req.dst_len; return ret; @@ -145,7 +145,7 @@ int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm, .dst_len = PAGE_SIZE, }; - return comp->ops->decompress(&zstrm->ctx, &req); + return comp->ops->decompress(comp->params, &zstrm->ctx, &req); } int zcomp_cpu_up_prepare(unsigned int cpu, struct hlist_node *node) @@ -173,7 +173,7 @@ int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node) return 0; } -static int zcomp_init(struct zcomp *comp) +static int zcomp_init(struct zcomp *comp, struct zcomp_params *params) { int ret; @@ -181,12 +181,19 @@ static int zcomp_init(struct zcomp *comp) if (!comp->stream) return -ENOMEM; + comp->params = params; + ret = comp->ops->setup_params(comp->params); + if (ret) + goto cleanup; + ret = cpuhp_state_add_instance(CPUHP_ZCOMP_PREPARE, &comp->node); if (ret < 0) goto cleanup; + return 0; cleanup: + comp->ops->release_params(comp->params); free_percpu(comp->stream); return ret; } @@ -194,6 +201,7 @@ cleanup: void zcomp_destroy(struct zcomp *comp) { cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node); + comp->ops->release_params(comp->params); free_percpu(comp->stream); kfree(comp); } @@ -215,14 +223,13 @@ struct zcomp *zcomp_create(const char *alg, struct zcomp_params *params) if (!comp) return ERR_PTR(-ENOMEM); - comp->params = params; comp->ops = lookup_backend_ops(alg); if (!comp->ops) { kfree(comp); return ERR_PTR(-EINVAL); } - error = zcomp_init(comp); + error = zcomp_init(comp, params); if (error) { kfree(comp); return ERR_PTR(error); diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h index 1d8920c2d449..ad5762813842 100644 --- a/drivers/block/zram/zcomp.h +++ b/drivers/block/zram/zcomp.h @@ -7,10 +7,18 @@ #define ZCOMP_PARAM_NO_LEVEL INT_MIN +/* + * Immutable driver (backend) parameters. The driver may attach private + * data to it (e.g. driver representation of the dictionary, etc.). + * + * This data is kept per-comp and is shared among execution contexts. + */ struct zcomp_params { void *dict; size_t dict_sz; s32 level; + + void *drv_data; }; /* @@ -38,13 +46,17 @@ struct zcomp_req { }; struct zcomp_ops { - int (*compress)(struct zcomp_ctx *ctx, struct zcomp_req *req); - int (*decompress)(struct zcomp_ctx *ctx, struct zcomp_req *req); + int (*compress)(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req); + int (*decompress)(struct zcomp_params *params, struct zcomp_ctx *ctx, + struct zcomp_req *req); - int (*create_ctx)(struct zcomp_params *params, - struct zcomp_ctx *ctx); + int (*create_ctx)(struct zcomp_params *params, struct zcomp_ctx *ctx); void (*destroy_ctx)(struct zcomp_ctx *ctx); + int (*setup_params)(struct zcomp_params *params); + void (*release_params)(struct zcomp_params *params); + const char *name; }; From fb4f644ee8dac4e896269b84f041eb5dd3f6249d Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:08 +0900 Subject: [PATCH 350/416] zram: add dictionary support to lz4 Support pre-trained dictionary param. lz4 doesn't mandate specific format of the dictionary and even zstd --train can be used to train a dictionary for lz4, according to [1]. TEST ==== *** lz4 /sys/block/zram0/mm_stat 1750654976 664188565 676864000 0 676864000 1 0 34288 34288 *** lz4 dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750638592 619891141 632053760 0 632053760 1 0 34278 34278 *** lz4 level=5 dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750638592 727174243 740810752 0 740810752 1 0 34437 34437 [1] https://github.com/lz4/lz4/issues/557 Link: https://lkml.kernel.org/r/20240902105656.1383858-21-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_lz4.c | 74 +++++++++++++++++++++++++++++--- 1 file changed, 67 insertions(+), 7 deletions(-) diff --git a/drivers/block/zram/backend_lz4.c b/drivers/block/zram/backend_lz4.c index cf3c029bd5ad..847f3334eb38 100644 --- a/drivers/block/zram/backend_lz4.c +++ b/drivers/block/zram/backend_lz4.c @@ -5,6 +5,13 @@ #include "backend_lz4.h" +struct lz4_ctx { + void *mem; + + LZ4_streamDecode_t *dstrm; + LZ4_stream_t *cstrm; +}; + static void lz4_release_params(struct zcomp_params *params) { } @@ -19,25 +26,66 @@ static int lz4_setup_params(struct zcomp_params *params) static void lz4_destroy(struct zcomp_ctx *ctx) { - vfree(ctx->context); + struct lz4_ctx *zctx = ctx->context; + + if (!zctx) + return; + + vfree(zctx->mem); + kfree(zctx->dstrm); + kfree(zctx->cstrm); + kfree(zctx); } static int lz4_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - ctx->context = vmalloc(LZ4_MEM_COMPRESS); - if (!ctx->context) + struct lz4_ctx *zctx; + + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) return -ENOMEM; + ctx->context = zctx; + if (params->dict_sz == 0) { + zctx->mem = vmalloc(LZ4_MEM_COMPRESS); + if (!zctx->mem) + goto error; + } else { + zctx->dstrm = kzalloc(sizeof(*zctx->dstrm), GFP_KERNEL); + if (!zctx->dstrm) + goto error; + + zctx->cstrm = kzalloc(sizeof(*zctx->cstrm), GFP_KERNEL); + if (!zctx->cstrm) + goto error; + } + return 0; + +error: + lz4_destroy(ctx); + return -ENOMEM; } static int lz4_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zcomp_req *req) { + struct lz4_ctx *zctx = ctx->context; int ret; - ret = LZ4_compress_fast(req->src, req->dst, req->src_len, - req->dst_len, params->level, ctx->context); + if (!zctx->cstrm) { + ret = LZ4_compress_fast(req->src, req->dst, req->src_len, + req->dst_len, params->level, + zctx->mem); + } else { + /* Cstrm needs to be reset */ + ret = LZ4_loadDict(zctx->cstrm, params->dict, params->dict_sz); + if (ret != params->dict_sz) + return -EINVAL; + ret = LZ4_compress_fast_continue(zctx->cstrm, req->src, + req->dst, req->src_len, + req->dst_len, params->level); + } if (!ret) return -EINVAL; req->dst_len = ret; @@ -47,10 +95,22 @@ static int lz4_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, static int lz4_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zcomp_req *req) { + struct lz4_ctx *zctx = ctx->context; int ret; - ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, - req->dst_len); + if (!zctx->dstrm) { + ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, + req->dst_len); + } else { + /* Dstrm needs to be reset */ + ret = LZ4_setStreamDecode(zctx->dstrm, params->dict, + params->dict_sz); + if (!ret) + return -EINVAL; + ret = LZ4_decompress_safe_continue(zctx->dstrm, req->src, + req->dst, req->src_len, + req->dst_len); + } if (ret < 0) return -EINVAL; return 0; From 1e673c8cf7f9c1156f615b7c00f224a8110070da Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:09 +0900 Subject: [PATCH 351/416] zram: add dictionary support to lz4hc Support pre-trained dictionary param. Just like lz4, lz4hc doesn't mandate specific format of the dictionary and zstd --train can be used to train a dictionary for lz4, according to [1]. TEST ==== *** lz4hc /sys/block/zram0/mm_stat 1750638592 608954620 621031424 0 621031424 1 0 34288 34288 *** lz4hc dict=/etc/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750671360 505068582 514994176 0 514994176 1 0 34278 34278 [1] https://github.com/lz4/lz4/issues/557 Link: https://lkml.kernel.org/r/20240902105656.1383858-22-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_lz4hc.c | 75 +++++++++++++++++++++++++++--- 1 file changed, 68 insertions(+), 7 deletions(-) diff --git a/drivers/block/zram/backend_lz4hc.c b/drivers/block/zram/backend_lz4hc.c index 928a6ea78668..5f37d5abcaeb 100644 --- a/drivers/block/zram/backend_lz4hc.c +++ b/drivers/block/zram/backend_lz4hc.c @@ -5,6 +5,13 @@ #include "backend_lz4hc.h" +struct lz4hc_ctx { + void *mem; + + LZ4_streamDecode_t *dstrm; + LZ4_streamHC_t *cstrm; +}; + static void lz4hc_release_params(struct zcomp_params *params) { } @@ -19,25 +26,67 @@ static int lz4hc_setup_params(struct zcomp_params *params) static void lz4hc_destroy(struct zcomp_ctx *ctx) { - vfree(ctx->context); + struct lz4hc_ctx *zctx = ctx->context; + + if (!zctx) + return; + + kfree(zctx->dstrm); + kfree(zctx->cstrm); + vfree(zctx->mem); + kfree(zctx); } static int lz4hc_create(struct zcomp_params *params, struct zcomp_ctx *ctx) { - ctx->context = vmalloc(LZ4HC_MEM_COMPRESS); - if (!ctx->context) + struct lz4hc_ctx *zctx; + + zctx = kzalloc(sizeof(*zctx), GFP_KERNEL); + if (!zctx) return -ENOMEM; + ctx->context = zctx; + if (params->dict_sz == 0) { + zctx->mem = vmalloc(LZ4HC_MEM_COMPRESS); + if (!zctx->mem) + goto error; + } else { + zctx->dstrm = kzalloc(sizeof(*zctx->dstrm), GFP_KERNEL); + if (!zctx->dstrm) + goto error; + + zctx->cstrm = kzalloc(sizeof(*zctx->cstrm), GFP_KERNEL); + if (!zctx->cstrm) + goto error; + } + return 0; + +error: + lz4hc_destroy(ctx); + return -EINVAL; } static int lz4hc_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zcomp_req *req) { + struct lz4hc_ctx *zctx = ctx->context; int ret; - ret = LZ4_compress_HC(req->src, req->dst, req->src_len, req->dst_len, - params->level, ctx->context); + if (!zctx->cstrm) { + ret = LZ4_compress_HC(req->src, req->dst, req->src_len, + req->dst_len, params->level, + zctx->mem); + } else { + /* Cstrm needs to be reset */ + LZ4_resetStreamHC(zctx->cstrm, params->level); + ret = LZ4_loadDictHC(zctx->cstrm, params->dict, + params->dict_sz); + if (ret != params->dict_sz) + return -EINVAL; + ret = LZ4_compress_HC_continue(zctx->cstrm, req->src, req->dst, + req->src_len, req->dst_len); + } if (!ret) return -EINVAL; req->dst_len = ret; @@ -47,10 +96,22 @@ static int lz4hc_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, static int lz4hc_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zcomp_req *req) { + struct lz4hc_ctx *zctx = ctx->context; int ret; - ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, - req->dst_len); + if (!zctx->dstrm) { + ret = LZ4_decompress_safe(req->src, req->dst, req->src_len, + req->dst_len); + } else { + /* Dstrm needs to be reset */ + ret = LZ4_setStreamDecode(zctx->dstrm, params->dict, + params->dict_sz); + if (!ret) + return -EINVAL; + ret = LZ4_decompress_safe_continue(zctx->dstrm, req->src, + req->dst, req->src_len, + req->dst_len); + } if (ret < 0) return -EINVAL; return 0; From 6a559ecd6e7ec26b691add74eab8ee3eb5dd2a87 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:10 +0900 Subject: [PATCH 352/416] zram: add dictionary support to zstd backend This adds support for pre-trained zstd dictionaries [1] Dictionary is setup in params once (per-comp) and loaded to Cctx and Dctx by reference, so we don't allocate extra memory. TEST ==== *** zstd /sys/block/zram0/mm_stat 1750654976 504565092 514203648 0 514203648 1 0 34204 34204 *** zstd dict=/etc/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750638592 465851259 475373568 0 475373568 1 0 34185 34185 *** zstd level=8 dict=/etc/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750642688 430765171 439955456 0 439955456 1 0 34185 34185 [1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md#dictionary-builder Link: https://lkml.kernel.org/r/20240902105656.1383858-23-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- drivers/block/zram/backend_zstd.c | 144 +++++++++++++++++++++++++----- 1 file changed, 121 insertions(+), 23 deletions(-) diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c index 5b33daf4f645..1184c0036f44 100644 --- a/drivers/block/zram/backend_zstd.c +++ b/drivers/block/zram/backend_zstd.c @@ -15,29 +15,87 @@ struct zstd_ctx { }; struct zstd_params { + zstd_custom_mem custom_mem; + zstd_cdict *cdict; + zstd_ddict *ddict; zstd_parameters cprm; }; +/* + * For C/D dictionaries we need to provide zstd with zstd_custom_mem, + * which zstd uses internally to allocate/free memory when needed. + * + * This means that allocator.customAlloc() can be called from zcomp_compress() + * under local-lock (per-CPU compression stream), in which case we must use + * GFP_ATOMIC. + * + * Another complication here is that we can be configured as a swap device. + */ +static void *zstd_custom_alloc(void *opaque, size_t size) +{ + if (!preemptible()) + return kvzalloc(size, GFP_ATOMIC); + + return kvzalloc(size, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN); +} + +static void zstd_custom_free(void *opaque, void *address) +{ + kvfree(address); +} + static void zstd_release_params(struct zcomp_params *params) { - kfree(params->drv_data); + struct zstd_params *zp = params->drv_data; + + params->drv_data = NULL; + if (!zp) + return; + + zstd_free_cdict(zp->cdict); + zstd_free_ddict(zp->ddict); + kfree(zp); } static int zstd_setup_params(struct zcomp_params *params) { + zstd_compression_parameters prm; struct zstd_params *zp; zp = kzalloc(sizeof(*zp), GFP_KERNEL); if (!zp) return -ENOMEM; + params->drv_data = zp; if (params->level == ZCOMP_PARAM_NO_LEVEL) params->level = zstd_default_clevel(); zp->cprm = zstd_get_params(params->level, PAGE_SIZE); - params->drv_data = zp; + + zp->custom_mem.customAlloc = zstd_custom_alloc; + zp->custom_mem.customFree = zstd_custom_free; + + prm = zstd_get_cparams(params->level, PAGE_SIZE, + params->dict_sz); + + zp->cdict = zstd_create_cdict_byreference(params->dict, + params->dict_sz, + prm, + zp->custom_mem); + if (!zp->cdict) + goto error; + + zp->ddict = zstd_create_ddict_byreference(params->dict, + params->dict_sz, + zp->custom_mem); + if (!zp->ddict) + goto error; return 0; + +error: + zstd_release_params(params); + return -EINVAL; } static void zstd_destroy(struct zcomp_ctx *ctx) @@ -47,8 +105,23 @@ static void zstd_destroy(struct zcomp_ctx *ctx) if (!zctx) return; - vfree(zctx->cctx_mem); - vfree(zctx->dctx_mem); + /* + * If ->cctx_mem and ->dctx_mem were allocated then we didn't use + * C/D dictionary and ->cctx / ->dctx were "embedded" into these + * buffers. + * + * If otherwise then we need to explicitly release ->cctx / ->dctx. + */ + if (zctx->cctx_mem) + vfree(zctx->cctx_mem); + else + zstd_free_cctx(zctx->cctx); + + if (zctx->dctx_mem) + vfree(zctx->dctx_mem); + else + zstd_free_dctx(zctx->dctx); + kfree(zctx); } @@ -63,28 +136,41 @@ static int zstd_create(struct zcomp_params *params, struct zcomp_ctx *ctx) return -ENOMEM; ctx->context = zctx; - prm = zstd_get_params(params->level, PAGE_SIZE); - sz = zstd_cctx_workspace_bound(&prm.cParams); - zctx->cctx_mem = vzalloc(sz); - if (!zctx->cctx_mem) - goto error; + if (params->dict_sz == 0) { + prm = zstd_get_params(params->level, PAGE_SIZE); + sz = zstd_cctx_workspace_bound(&prm.cParams); + zctx->cctx_mem = vzalloc(sz); + if (!zctx->cctx_mem) + goto error; - zctx->cctx = zstd_init_cctx(zctx->cctx_mem, sz); - if (!zctx->cctx) - goto error; + zctx->cctx = zstd_init_cctx(zctx->cctx_mem, sz); + if (!zctx->cctx) + goto error; - sz = zstd_dctx_workspace_bound(); - zctx->dctx_mem = vzalloc(sz); - if (!zctx->dctx_mem) - goto error; + sz = zstd_dctx_workspace_bound(); + zctx->dctx_mem = vzalloc(sz); + if (!zctx->dctx_mem) + goto error; - zctx->dctx = zstd_init_dctx(zctx->dctx_mem, sz); - if (!zctx->dctx) - goto error; + zctx->dctx = zstd_init_dctx(zctx->dctx_mem, sz); + if (!zctx->dctx) + goto error; + } else { + struct zstd_params *zp = params->drv_data; + + zctx->cctx = zstd_create_cctx_advanced(zp->custom_mem); + if (!zctx->cctx) + goto error; + + zctx->dctx = zstd_create_dctx_advanced(zp->custom_mem); + if (!zctx->dctx) + goto error; + } return 0; error: + zstd_release_params(params); zstd_destroy(ctx); return -EINVAL; } @@ -96,8 +182,14 @@ static int zstd_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zstd_ctx *zctx = ctx->context; size_t ret; - ret = zstd_compress_cctx(zctx->cctx, req->dst, req->dst_len, - req->src, req->src_len, &zp->cprm); + if (params->dict_sz == 0) + ret = zstd_compress_cctx(zctx->cctx, req->dst, req->dst_len, + req->src, req->src_len, &zp->cprm); + else + ret = zstd_compress_using_cdict(zctx->cctx, req->dst, + req->dst_len, req->src, + req->src_len, + zp->cdict); if (zstd_is_error(ret)) return -EINVAL; req->dst_len = ret; @@ -107,11 +199,17 @@ static int zstd_compress(struct zcomp_params *params, struct zcomp_ctx *ctx, static int zstd_decompress(struct zcomp_params *params, struct zcomp_ctx *ctx, struct zcomp_req *req) { + struct zstd_params *zp = params->drv_data; struct zstd_ctx *zctx = ctx->context; size_t ret; - ret = zstd_decompress_dctx(zctx->dctx, req->dst, req->dst_len, - req->src, req->src_len); + if (params->dict_sz == 0) + ret = zstd_decompress_dctx(zctx->dctx, req->dst, req->dst_len, + req->src, req->src_len); + else + ret = zstd_decompress_using_ddict(zctx->dctx, req->dst, + req->dst_len, req->src, + req->src_len, zp->ddict); if (zstd_is_error(ret)) return -EINVAL; return 0; From 97ee4842f23837c80091d7ac04b01e58d56085a5 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:11 +0900 Subject: [PATCH 353/416] Documentation/zram: add documentation for algorithm parameters Document brief description of compression algorithms' parameters: compression level and pre-trained dictionary. [senozhatsky@chromium.org: trivial fixup] Link: https://lkml.kernel.org/r/20240903063722.1603592-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-24-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 47 +++++++++++++++++---- 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 96d81dc12528..d3b676273528 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -105,7 +105,38 @@ Examples:: For the time being, the `comp_algorithm` content shows only compression algorithms that are supported by zram. -4) Set Disksize +4) Set compression algorithm parameters: Optional +================================================= + +Compression algorithms may support specific parameters which can be +tweaked for particular dataset. ZRAM has an `algorithm_params` device +attribute which provides a per-algorithm params configuration. + +For example, several compression algorithms support `level` parameter. +In addition, certain compression algorithms support pre-trained dictionaries, +which significantly change algorithms' characteristics. In order to configure +compression algorithm to use external pre-trained dictionary, pass full +path to the `dict` along with other parameters:: + + #pass path to pre-trained zstd dictionary + echo "algo=zstd dict=/etc/dictioary" > /sys/block/zram0/algorithm_params + + #same, but using algorithm priority + echo "priority=1 dict=/etc/dictioary" > \ + /sys/block/zram0/algorithm_params + + #pass path to pre-trained zstd dictionary and compression level + echo "algo=zstd level=8 dict=/etc/dictioary" > \ + /sys/block/zram0/algorithm_params + +Parameters are algorithm specific: not all algorithms support pre-trained +dictionaries, not all algorithms support `level`. Furthermore, for certain +algorithms `level` controls the compression level (the higher the value the +better the compression ratio, it even can take negatives values for some +algorithms), for other algorithms `level` is acceleration level (the higher +the value the lower the compression ratio). + +5) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -125,7 +156,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -5) Set memory limit: Optional +6) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -144,7 +175,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -6) Activate +7) Activate =========== :: @@ -155,7 +186,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -7) Add/remove zram devices +8) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -175,7 +206,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -8) Stats +9) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram/ @@ -277,15 +308,15 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -9) Deactivate -============= +10) Deactivate +============== :: swapoff /dev/zram0 umount /dev/zram1 -10) Reset +11) Reset ========= Write any positive value to 'reset' sysfs node:: From e899007a5e10084649f558593f7e8f26088492fc Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Mon, 2 Sep 2024 19:56:12 +0900 Subject: [PATCH 354/416] zram: support priority parameter in recompression recompress device attribute supports alg=NAME parameter so that we can specify only one particular algorithm we want to perform recompression with. However, with algo params we now can have several exactly same secondary algorithms but each with its own params tuning (e.g. priority 1 configured to use more aggressive level, and priority 2 configured to use a pre-trained dictionary). Support priority=NUM parameter so that we can correctly determine which secondary algorithm we want to use. Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nick Terrell Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 5 ++++- drivers/block/zram/zram_drv.c | 12 ++++++++++++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index d3b676273528..678d70d6e1c3 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -512,11 +512,14 @@ registered compression algorithms, increases our chances of finding the algorithm that successfully compresses a particular page. Sometimes, however, it is convenient (and sometimes even necessary) to limit recompression to only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a algo=NAME parameter::: +This can be achieved by providing a `algo` or `priority` parameter::: #use zstd algorithm only (if registered) echo "type=huge algo=zstd" > /sys/block/zramX/recompress + #use zstd algorithm only (if zstd was registered under priority 1) + echo "type=huge priority=1" > /sys/block/zramX/recompress + memory tracking =============== diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 9b29f9e6cb50..1f1bf175a6c3 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1855,6 +1855,18 @@ static ssize_t recompress_store(struct device *dev, algo = val; continue; } + + if (!strcmp(param, "priority")) { + ret = kstrtouint(val, 10, &prio); + if (ret) + return ret; + + if (prio == ZRAM_PRIMARY_COMP) + prio = ZRAM_SECONDARY_COMP; + + prio_max = min(prio + 1, ZRAM_MAX_COMPS); + continue; + } } if (threshold >= huge_class_size) From e1e4cfd01a6e75dd4c810aeac115340805cf63ff Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 3 Sep 2024 11:19:28 -0400 Subject: [PATCH 355/416] mm,tmpfs: consider end of file write in shmem_is_huge Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size This allows large writes that append to the end of a file to automatically use large pages. Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always. huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com Signed-off-by: Rik van Riel Reviewed-by: Baolin Wang Tested-by: Baolin Wang Cc: Darrick J. Wong Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- fs/xfs/scrub/xfile.c | 6 ++-- fs/xfs/xfs_buf_mem.c | 2 +- include/linux/shmem_fs.h | 8 +++--- mm/huge_memory.c | 2 +- mm/khugepaged.c | 2 +- mm/shmem.c | 59 +++++++++++++++++++++------------------- mm/userfaultfd.c | 2 +- 7 files changed, 42 insertions(+), 39 deletions(-) diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c index 9b5d98fe1f8a..c753c79df203 100644 --- a/fs/xfs/scrub/xfile.c +++ b/fs/xfs/scrub/xfile.c @@ -126,7 +126,7 @@ xfile_load( unsigned int len; unsigned int offset; - if (shmem_get_folio(inode, pos >> PAGE_SHIFT, &folio, + if (shmem_get_folio(inode, pos >> PAGE_SHIFT, 0, &folio, SGP_READ) < 0) break; if (!folio) { @@ -196,7 +196,7 @@ xfile_store( unsigned int len; unsigned int offset; - if (shmem_get_folio(inode, pos >> PAGE_SHIFT, &folio, + if (shmem_get_folio(inode, pos >> PAGE_SHIFT, 0, &folio, SGP_CACHE) < 0) break; if (filemap_check_wb_err(inode->i_mapping, 0)) { @@ -267,7 +267,7 @@ xfile_get_folio( i_size_write(inode, pos + len); pflags = memalloc_nofs_save(); - error = shmem_get_folio(inode, pos >> PAGE_SHIFT, &folio, + error = shmem_get_folio(inode, pos >> PAGE_SHIFT, 0, &folio, (flags & XFILE_ALLOC) ? SGP_CACHE : SGP_READ); memalloc_nofs_restore(pflags); if (error) diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c index 9bb2d24de709..07bebbfb16ee 100644 --- a/fs/xfs/xfs_buf_mem.c +++ b/fs/xfs/xfs_buf_mem.c @@ -149,7 +149,7 @@ xmbuf_map_page( return -ENOMEM; } - error = shmem_get_folio(inode, pos >> PAGE_SHIFT, &folio, SGP_CACHE); + error = shmem_get_folio(inode, pos >> PAGE_SHIFT, 0, &folio, SGP_CACHE); if (error) return error; diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 1564d7d3ca61..515a9a6a3c6f 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -113,11 +113,11 @@ int shmem_unuse(unsigned int type); #ifdef CONFIG_TRANSPARENT_HUGEPAGE unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool shmem_huge_force); + loff_t write_end, bool shmem_huge_force); #else static inline unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool shmem_huge_force) + loff_t write_end, bool shmem_huge_force) { return 0; } @@ -143,8 +143,8 @@ enum sgp_type { SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ }; -int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, - enum sgp_type sgp); +int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end, + struct folio **foliop, enum sgp_type sgp); struct folio *shmem_read_folio_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 691702e39f85..77092581f90d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -164,7 +164,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, */ if (!in_pf && shmem_file(vma->vm_file)) return shmem_allowable_huge_orders(file_inode(vma->vm_file), - vma, vma->vm_pgoff, + vma, vma->vm_pgoff, 0, !enforce_sysfs); if (!vma_is_anonymous(vma)) { diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 32100041aef3..f9c39898eaff 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1870,7 +1870,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, if (xa_is_value(folio) || !folio_test_uptodate(folio)) { xas_unlock_irq(&xas); /* swap in or instantiate fallocated page */ - if (shmem_get_folio(mapping->host, index, + if (shmem_get_folio(mapping->host, index, 0, &folio, SGP_NOALLOC)) { result = SCAN_FAIL; goto xa_unlocked; diff --git a/mm/shmem.c b/mm/shmem.c index 553b99cb265e..74f093d88c78 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -549,7 +549,8 @@ static bool shmem_confirm_swap(struct address_space *mapping, static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER; static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct vm_area_struct *vma, + loff_t write_end, bool shmem_huge_force, + struct vm_area_struct *vma, unsigned long vm_flags) { struct mm_struct *mm = vma ? vma->vm_mm : NULL; @@ -569,7 +570,8 @@ static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, return true; case SHMEM_HUGE_WITHIN_SIZE: index = round_up(index + 1, HPAGE_PMD_NR); - i_size = round_up(i_size_read(inode), PAGE_SIZE); + i_size = max(write_end, i_size_read(inode)); + i_size = round_up(i_size, PAGE_SIZE); if (i_size >> PAGE_SHIFT >= index) return true; fallthrough; @@ -583,14 +585,14 @@ static bool __shmem_huge_global_enabled(struct inode *inode, pgoff_t index, } static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct vm_area_struct *vma, - unsigned long vm_flags) + loff_t write_end, bool shmem_huge_force, + struct vm_area_struct *vma, unsigned long vm_flags) { if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) return false; - return __shmem_huge_global_enabled(inode, index, shmem_huge_force, - vma, vm_flags); + return __shmem_huge_global_enabled(inode, index, write_end, + shmem_huge_force, vma, vm_flags); } #if defined(CONFIG_SYSFS) @@ -770,8 +772,8 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, } static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, - bool shmem_huge_force, struct vm_area_struct *vma, - unsigned long vm_flags) + loff_t write_end, bool shmem_huge_force, + struct vm_area_struct *vma, unsigned long vm_flags) { return false; } @@ -978,7 +980,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) * (although in some cases this is just a waste of time). */ folio = NULL; - shmem_get_folio(inode, index, &folio, SGP_READ); + shmem_get_folio(inode, index, 0, &folio, SGP_READ); return folio; } @@ -1166,7 +1168,7 @@ static int shmem_getattr(struct mnt_idmap *idmap, STATX_ATTR_NODUMP); generic_fillattr(idmap, request_mask, inode, stat); - if (shmem_huge_global_enabled(inode, 0, false, NULL, 0)) + if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0)) stat->blksize = HPAGE_PMD_SIZE; if (request_mask & STATX_BTIME) { @@ -1653,7 +1655,7 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) #ifdef CONFIG_TRANSPARENT_HUGEPAGE unsigned long shmem_allowable_huge_orders(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - bool shmem_huge_force) + loff_t write_end, bool shmem_huge_force) { unsigned long mask = READ_ONCE(huge_shmem_orders_always); unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size); @@ -1670,8 +1672,8 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) return 0; - global_huge = shmem_huge_global_enabled(inode, index, shmem_huge_force, - vma, vm_flags); + global_huge = shmem_huge_global_enabled(inode, index, write_end, + shmem_huge_force, vma, vm_flags); if (!vma || !vma_is_anon_shmem(vma)) { /* * For tmpfs, we now only support PMD sized THP if huge page @@ -2231,8 +2233,8 @@ unlock: * vmf and fault_type are only supplied by shmem_fault: otherwise they are NULL. */ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, - struct folio **foliop, enum sgp_type sgp, gfp_t gfp, - struct vm_fault *vmf, vm_fault_t *fault_type) + loff_t write_end, struct folio **foliop, enum sgp_type sgp, + gfp_t gfp, struct vm_fault *vmf, vm_fault_t *fault_type) { struct vm_area_struct *vma = vmf ? vmf->vma : NULL; struct mm_struct *fault_mm; @@ -2312,7 +2314,7 @@ repeat: } /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */ - orders = shmem_allowable_huge_orders(inode, vma, index, false); + orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false); if (orders > 0) { gfp_t huge_gfp; @@ -2413,6 +2415,7 @@ unlock: * shmem_get_folio - find, and lock a shmem folio. * @inode: inode to search * @index: the page index. + * @write_end: end of a write, could extend inode size * @foliop: pointer to the folio if found * @sgp: SGP_* flags to control behavior * @@ -2432,10 +2435,10 @@ unlock: * Context: May sleep. * Return: 0 if successful, else a negative error code. */ -int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, - enum sgp_type sgp) +int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end, + struct folio **foliop, enum sgp_type sgp) { - return shmem_get_folio_gfp(inode, index, foliop, sgp, + return shmem_get_folio_gfp(inode, index, write_end, foliop, sgp, mapping_gfp_mask(inode->i_mapping), NULL, NULL); } EXPORT_SYMBOL_GPL(shmem_get_folio); @@ -2530,7 +2533,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) } WARN_ON_ONCE(vmf->page != NULL); - err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, + err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, SGP_CACHE, gfp, vmf, &ret); if (err) return vmf_error(err); @@ -3040,7 +3043,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; } - ret = shmem_get_folio(inode, index, &folio, SGP_WRITE); + ret = shmem_get_folio(inode, index, pos + len, &folio, SGP_WRITE); if (ret) return ret; @@ -3111,7 +3114,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) break; } - error = shmem_get_folio(inode, index, &folio, SGP_READ); + error = shmem_get_folio(inode, index, 0, &folio, SGP_READ); if (error) { if (error == -EINVAL) error = 0; @@ -3287,7 +3290,7 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos, if (*ppos >= i_size_read(inode)) break; - error = shmem_get_folio(inode, *ppos / PAGE_SIZE, &folio, + error = shmem_get_folio(inode, *ppos / PAGE_SIZE, 0, &folio, SGP_READ); if (error) { if (error == -EINVAL) @@ -3477,8 +3480,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) error = -ENOMEM; else - error = shmem_get_folio(inode, index, &folio, - SGP_FALLOC); + error = shmem_get_folio(inode, index, offset + len, + &folio, SGP_FALLOC); if (error) { info->fallocend = undo_fallocend; /* Remove the !uptodate folios we added */ @@ -3829,7 +3832,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir, } else { inode_nohighmem(inode); inode->i_mapping->a_ops = &shmem_aops; - error = shmem_get_folio(inode, 0, &folio, SGP_WRITE); + error = shmem_get_folio(inode, 0, 0, &folio, SGP_WRITE); if (error) goto out_remove_offset; inode->i_op = &shmem_symlink_inode_operations; @@ -3875,7 +3878,7 @@ static const char *shmem_get_link(struct dentry *dentry, struct inode *inode, return ERR_PTR(-ECHILD); } } else { - error = shmem_get_folio(inode, 0, &folio, SGP_READ); + error = shmem_get_folio(inode, 0, 0, &folio, SGP_READ); if (error) return ERR_PTR(error); if (!folio) @@ -5343,7 +5346,7 @@ struct folio *shmem_read_folio_gfp(struct address_space *mapping, struct folio *folio; int error; - error = shmem_get_folio_gfp(inode, index, &folio, SGP_CACHE, + error = shmem_get_folio_gfp(inode, index, 0, &folio, SGP_CACHE, gfp, NULL, NULL); if (error) return ERR_PTR(error); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 966e6c81a685..a609b2927848 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -391,7 +391,7 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct page *page; int ret; - ret = shmem_get_folio(inode, pgoff, &folio, SGP_NOALLOC); + ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) ret = -EFAULT; From fc1b43c422f3deee0cfc221b071c3863dc077646 Mon Sep 17 00:00:00 2001 From: Takaya Saeki Date: Tue, 3 Sep 2024 10:21:00 +0000 Subject: [PATCH 356/416] filemap: fix the last_index of mm_filemap_get_pages In commit b6273b55d885 ("filemap: add trace events for get_pages, map_pages, and fault"), mm_filemap_get_pages was added to trace page cache access. However, it tracks an extra page beyond the end of the accessed range. This patch fixes it by replacing last_index with last_index - 1. Link: https://lkml.kernel.org/r/20240903102100.70405-1-takayas@chromium.org Fixes: b6273b55d885 ("filemap: add trace events for get_pages, map_pages, and fault") Signed-off-by: Takaya Saeki Cc: Junichi Uekawa Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Matthew Wilcox Cc: Steven Rostedt (Google) Signed-off-by: Andrew Morton --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 88a2ed008474..7eb4637ea199 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2567,7 +2567,7 @@ retry: goto err; } - trace_mm_filemap_get_pages(mapping, index, last_index); + trace_mm_filemap_get_pages(mapping, index, last_index - 1); return 0; err: if (err < 0) From 5ad7a998ba921e37d7e373a2d70d7da401b27f94 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 3 Sep 2024 13:00:22 +0900 Subject: [PATCH 357/416] mm: Kconfig: fixup zsmalloc configuration zsmalloc is not exclusive to zswap. Commit b3fbd58fcbb1 ("mm: Kconfig: simplify zswap configuration") made CONFIG_ZSMALLOC only visible when CONFIG_ZSWAP is selected, which makes it impossible to menuconfig zsmalloc-specific features (stats, chain-size, etc.) on systems that use ZRAM but don't have ZSWAP enabled. Make zsmalloc depend on both ZRAM and ZSWAP. Link: https://lkml.kernel.org/r/20240903040143.1580705-1-senozhatsky@chromium.org Fixes: b3fbd58fcbb1 ("mm: Kconfig: simplify zswap configuration") Signed-off-by: Sergey Senozhatsky Cc: Johannes Weiner Cc: Minchan Kim Signed-off-by: Andrew Morton --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 8078a4b3c509..7c23e8a889d0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -188,7 +188,7 @@ config Z3FOLD config ZSMALLOC tristate - prompt "N:1 compression allocator (zsmalloc)" if ZSWAP + prompt "N:1 compression allocator (zsmalloc)" if (ZSWAP || ZRAM) depends on MMU help zsmalloc is a slab-based memory allocator designed to store From f0679f9e6d88ae9e84fca35ada84f5859a19b227 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Wed, 4 Sep 2024 10:29:31 -0700 Subject: [PATCH 358/416] mm/damon/tests/vaddr-kunit: init maple tree without MT_FLAGS_LOCK_EXTERN damon_test_three_regions_in_vmas() initializes a maple tree with MM_MT_FLAGS. The flags contains MT_FLAGS_LOCK_EXTERN, which means mt_lock of the maple tree will not be used. And therefore the maple tree initialization code skips initialization of the mt_lock. However, __link_vmas(), which adds vmas for test to the maple tree, uses the mt_lock. In other words, the uninitialized spinlock is used. The problem becomes clear when spinlock debugging is turned on, since it reports spinlock bad magic bug. Fix the issue by excluding MT_FLAGS_LOCK_EXTERN from the maple tree initialization flags. Note that we don't use empty flags to make it further similar to the usage of mm maple tree, and to be prepared for possible future changes, as suggested by Liam. Link: https://lkml.kernel.org/r/20240904172931.1284-1-sj@kernel.org Fixes: d0cf3dd47f0d ("damon: convert __damon_va_three_regions to use the VMA iterator") Signed-off-by: SeongJae Park Reported-by: Guenter Roeck Closes: https://lore.kernel.org/1453b2b2-6119-4082-ad9e-f3c5239bf87e@roeck-us.net Suggested-by: Liam R. Howlett Tested-by: Guenter Roeck Signed-off-by: Andrew Morton --- mm/damon/tests/vaddr-kunit.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/damon/tests/vaddr-kunit.h b/mm/damon/tests/vaddr-kunit.h index 83626483f82b..a339d117150f 100644 --- a/mm/damon/tests/vaddr-kunit.h +++ b/mm/damon/tests/vaddr-kunit.h @@ -77,7 +77,7 @@ static void damon_test_three_regions_in_vmas(struct kunit *test) (struct vm_area_struct) {.vm_start = 307, .vm_end = 330}, }; - mt_init_flags(&mm.mm_mt, MM_MT_FLAGS); + mt_init_flags(&mm.mm_mt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_USE_RCU); if (__link_vmas(&mm.mm_mt, vmas, ARRAY_SIZE(vmas))) kunit_skip(test, "Failed to create VMA tree"); From 25d4054cc97484f2555709ac233f955f674e026a Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Wed, 4 Sep 2024 17:57:59 +0100 Subject: [PATCH 359/416] mm: make arch_get_unmapped_area() take vm_flags by default Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2. As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it. This patch (of 3): When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter. The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown Acked-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Acked-by: Helge Deller [parisc] Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Borislav Petkov Cc: Christian Borntraeger Cc: Christophe Leroy Cc: Chris Zankel Cc: Dave Hansen Cc: David S. Miller Cc: "Edgecombe, Rick P" Cc: Gerald Schaefer Cc: Guo Ren Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Bottomley Cc: John Paul Adrian Glaubitz Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Richard Henderson Cc: Rich Felker Cc: Russell King Cc: Sven Schnelle Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: WANG Xuerui Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- arch/alpha/kernel/osf_sys.c | 2 +- arch/arc/mm/mmap.c | 3 ++- arch/arm/mm/mmap.c | 7 ++++--- arch/csky/abiv1/mmap.c | 3 ++- arch/loongarch/mm/mmap.c | 5 +++-- arch/mips/mm/mmap.c | 5 +++-- arch/parisc/kernel/sys_parisc.c | 5 +++-- arch/parisc/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/book3s64/slice.c | 6 ++++-- arch/s390/mm/mmap.c | 4 ++-- arch/sh/mm/mmap.c | 5 +++-- arch/sparc/kernel/sys_sparc_32.c | 2 +- arch/sparc/kernel/sys_sparc_64.c | 4 ++-- arch/x86/include/asm/pgtable_64.h | 1 - arch/x86/kernel/sys_x86_64.c | 21 +++------------------ arch/xtensa/kernel/syscall.c | 3 ++- include/linux/sched/mm.h | 23 ++++++++--------------- mm/mmap.c | 31 +++++++------------------------ 18 files changed, 51 insertions(+), 81 deletions(-) diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index e5f881bc8288..8886ab539273 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1229,7 +1229,7 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned long len, unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { unsigned long limit; diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c index 69a915297155..2185afe8d59f 100644 --- a/arch/arc/mm/mmap.c +++ b/arch/arc/mm/mmap.c @@ -23,7 +23,8 @@ */ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, + unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..3dbb383c26d5 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -28,7 +28,8 @@ */ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, + unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; @@ -78,8 +79,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, - const unsigned long len, const unsigned long pgoff, - const unsigned long flags) + const unsigned long len, const unsigned long pgoff, + const unsigned long flags, vm_flags_t vm_flags) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c index 7f826331d409..1047865e82a9 100644 --- a/arch/csky/abiv1/mmap.c +++ b/arch/csky/abiv1/mmap.c @@ -23,7 +23,8 @@ */ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, + unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..914e82ff3f65 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -89,7 +89,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, } unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, unsigned long flags, + vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr0, len, pgoff, flags, UP); @@ -101,7 +102,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0, */ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr0, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr0, len, pgoff, flags, DOWN); diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..5d2a1225785b 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -98,7 +98,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, } unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, unsigned long flags, + vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr0, len, pgoff, flags, UP); @@ -110,7 +111,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0, */ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr0, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr0, len, pgoff, flags, DOWN); diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..f852fe274abe 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -167,7 +167,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, } unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, unsigned long flags, + vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr, len, pgoff, flags, UP); @@ -175,7 +176,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { return arch_get_unmapped_area_common(filp, addr, len, pgoff, flags, DOWN); diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c index 0356199bd9e7..aa664f7ddb63 100644 --- a/arch/parisc/mm/hugetlbpage.c +++ b/arch/parisc/mm/hugetlbpage.c @@ -40,7 +40,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, addr = ALIGN(addr, huge_page_size(h)); /* we need to make sure the colouring is OK */ - return arch_get_unmapped_area(file, addr, len, pgoff, flags); + return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0); } diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..ada6bf896ef8 100644 --- a/arch/powerpc/mm/book3s64/slice.c +++ b/arch/powerpc/mm/book3s64/slice.c @@ -637,7 +637,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, + vm_flags_t vm_flags) { if (radix_enabled()) return generic_get_unmapped_area(filp, addr, len, pgoff, flags); @@ -650,7 +651,8 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, const unsigned long len, const unsigned long pgoff, - const unsigned long flags) + const unsigned long flags, + vm_flags_t vm_flags) { if (radix_enabled()) return generic_get_unmapped_area_topdown(filp, addr0, len, pgoff, flags); diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c index 206756946589..96efa061ce01 100644 --- a/arch/s390/mm/mmap.c +++ b/arch/s390/mm/mmap.c @@ -82,7 +82,7 @@ static int get_align_mask(struct file *filp, unsigned long flags) unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; @@ -117,7 +117,7 @@ check_asce_limit: unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c index bee329d4149a..c442734d9b0c 100644 --- a/arch/sh/mm/mmap.c +++ b/arch/sh/mm/mmap.c @@ -52,7 +52,8 @@ static inline unsigned long COLOUR_ALIGN(unsigned long addr, } unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, unsigned long flags, + vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; @@ -99,7 +100,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, const unsigned long len, const unsigned long pgoff, - const unsigned long flags) + const unsigned long flags, vm_flags_t vm_flags) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; diff --git a/arch/sparc/kernel/sys_sparc_32.c b/arch/sparc/kernel/sys_sparc_32.c index 08a19727795c..80822f922e76 100644 --- a/arch/sparc/kernel/sys_sparc_32.c +++ b/arch/sparc/kernel/sys_sparc_32.c @@ -39,7 +39,7 @@ SYSCALL_DEFINE0(getpagesize) return PAGE_SIZE; /* Possibly older binaries want 8192 on sun4's? */ } -unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) +unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { struct vm_unmapped_area_info info = {}; diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index d9c3b34ca744..acade309dc2f 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -87,7 +87,7 @@ static inline unsigned long COLOR_ALIGN(unsigned long addr, return base + off; } -unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) +unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; @@ -146,7 +146,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi unsigned long arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, const unsigned long len, const unsigned long pgoff, - const unsigned long flags) + const unsigned long flags, vm_flags_t vm_flags) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 3c4407271d08..7e9db77231ac 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -245,7 +245,6 @@ extern void cleanup_highmap(void); #define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN -#define HAVE_ARCH_UNMAPPED_AREA_VMFLAGS #define PAGE_AGP PAGE_KERNEL_NOCACHE #define HAVE_PAGE_AGP 1 diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 01d7cd85ef97..87f8c9a71c49 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -121,7 +121,7 @@ static inline unsigned long stack_guard_placement(vm_flags_t vm_flags) } unsigned long -arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned long len, +arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; @@ -158,7 +158,7 @@ arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned l } unsigned long -arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, +arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr0, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { @@ -228,20 +228,5 @@ bottomup: * can happen with large stack limits and large mmap() * allocations. */ - return arch_get_unmapped_area(filp, addr0, len, pgoff, flags); -} - -unsigned long -arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) -{ - return arch_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags, 0); -} - -unsigned long -arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr, - const unsigned long len, const unsigned long pgoff, - const unsigned long flags) -{ - return arch_get_unmapped_area_topdown_vmflags(filp, addr, len, pgoff, flags, 0); + return arch_get_unmapped_area(filp, addr0, len, pgoff, flags, 0); } diff --git a/arch/xtensa/kernel/syscall.c b/arch/xtensa/kernel/syscall.c index b3c2450d6f23..dc54f854c2f5 100644 --- a/arch/xtensa/kernel/syscall.c +++ b/arch/xtensa/kernel/syscall.c @@ -55,7 +55,8 @@ asmlinkage long xtensa_fadvise64_64(int fd, int advice, #ifdef CONFIG_MMU unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, unsigned long flags) + unsigned long len, unsigned long pgoff, unsigned long flags, + vm_flags_t vm_flags) { struct vm_area_struct *vmm; struct vma_iterator vmi; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 91546493c43d..c4d34abc45d4 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -179,27 +179,20 @@ static inline void mm_update_next_owner(struct mm_struct *mm) extern void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack); -extern unsigned long -arch_get_unmapped_area(struct file *, unsigned long, unsigned long, - unsigned long, unsigned long); -extern unsigned long + +unsigned long +arch_get_unmapped_area(struct file *filp, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags, vm_flags_t vm_flags); +unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, - unsigned long flags); + unsigned long len, unsigned long pgoff, + unsigned long flags, vm_flags_t); unsigned long mm_get_unmapped_area(struct mm_struct *mm, struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); -unsigned long -arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, - unsigned long flags, vm_flags_t vm_flags); -unsigned long -arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, - unsigned long flags, vm_flags_t); - unsigned long mm_get_unmapped_area_vmflags(struct mm_struct *mm, struct file *filp, unsigned long addr, diff --git a/mm/mmap.c b/mm/mmap.c index 02f7b45c3076..98249266196e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -770,7 +770,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { return generic_get_unmapped_area(filp, addr, len, pgoff, flags); } @@ -834,38 +834,21 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { return generic_get_unmapped_area_topdown(filp, addr, len, pgoff, flags); } #endif -#ifndef HAVE_ARCH_UNMAPPED_AREA_VMFLAGS -unsigned long -arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned long len, - unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) -{ - return arch_get_unmapped_area(filp, addr, len, pgoff, flags); -} - -unsigned long -arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr, - unsigned long len, unsigned long pgoff, - unsigned long flags, vm_flags_t vm_flags) -{ - return arch_get_unmapped_area_topdown(filp, addr, len, pgoff, flags); -} -#endif - unsigned long mm_get_unmapped_area_vmflags(struct mm_struct *mm, struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { if (test_bit(MMF_TOPDOWN, &mm->flags)) - return arch_get_unmapped_area_topdown_vmflags(filp, addr, len, pgoff, - flags, vm_flags); - return arch_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags, vm_flags); + return arch_get_unmapped_area_topdown(filp, addr, len, pgoff, + flags, vm_flags); + return arch_get_unmapped_area(filp, addr, len, pgoff, flags, vm_flags); } unsigned long @@ -927,8 +910,8 @@ mm_get_unmapped_area(struct mm_struct *mm, struct file *file, unsigned long pgoff, unsigned long flags) { if (test_bit(MMF_TOPDOWN, &mm->flags)) - return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags); - return arch_get_unmapped_area(file, addr, len, pgoff, flags); + return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags, 0); + return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0); } EXPORT_SYMBOL(mm_get_unmapped_area); From 540e00a729dfbefe0aa0d64b6dac1d1b620a1787 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Wed, 4 Sep 2024 17:58:00 +0100 Subject: [PATCH 360/416] mm: pass vm_flags to generic_get_unmapped_area() In preparation for using vm_flags to ensure guard pages for shadow stacks supply them as an argument to generic_get_unmapped_area(). The only user outside of the core code is the PowerPC book3s64 implementation which is trivially wrapping the generic implementation in the radix_enabled() case. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown Acked-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Acked-by: Michael Ellerman Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Borislav Petkov Cc: Christian Borntraeger Cc: Christophe Leroy Cc: Chris Zankel Cc: Dave Hansen Cc: David S. Miller Cc: "Edgecombe, Rick P" Cc: Gerald Schaefer Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Bottomley Cc: John Paul Adrian Glaubitz Cc: Matt Turner Cc: Max Filippov Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Richard Henderson Cc: Rich Felker Cc: Russell King Cc: Sven Schnelle Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: WANG Xuerui Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- arch/powerpc/mm/book3s64/slice.c | 4 ++-- include/linux/sched/mm.h | 4 ++-- mm/mmap.c | 10 ++++++---- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ada6bf896ef8..87307d0fc3b8 100644 --- a/arch/powerpc/mm/book3s64/slice.c +++ b/arch/powerpc/mm/book3s64/slice.c @@ -641,7 +641,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, vm_flags_t vm_flags) { if (radix_enabled()) - return generic_get_unmapped_area(filp, addr, len, pgoff, flags); + return generic_get_unmapped_area(filp, addr, len, pgoff, flags, vm_flags); return slice_get_unmapped_area(addr, len, flags, mm_ctx_user_psize(¤t->mm->context), 0); @@ -655,7 +655,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, vm_flags_t vm_flags) { if (radix_enabled()) - return generic_get_unmapped_area_topdown(filp, addr0, len, pgoff, flags); + return generic_get_unmapped_area_topdown(filp, addr0, len, pgoff, flags, vm_flags); return slice_get_unmapped_area(addr0, len, flags, mm_ctx_user_psize(¤t->mm->context), 1); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index c4d34abc45d4..07bb8d4181d7 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -204,11 +204,11 @@ unsigned long mm_get_unmapped_area_vmflags(struct mm_struct *mm, unsigned long generic_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags); + unsigned long flags, vm_flags_t vm_flags); unsigned long generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags); + unsigned long flags, vm_flags_t vm_flags); #else static inline void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack) {} diff --git a/mm/mmap.c b/mm/mmap.c index 98249266196e..14d36be65ce7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -738,7 +738,7 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) unsigned long generic_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -772,7 +772,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { - return generic_get_unmapped_area(filp, addr, len, pgoff, flags); + return generic_get_unmapped_area(filp, addr, len, pgoff, flags, + vm_flags); } #endif @@ -783,7 +784,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, - unsigned long flags) + unsigned long flags, vm_flags_t vm_flags) { struct vm_area_struct *vma, *prev; struct mm_struct *mm = current->mm; @@ -836,7 +837,8 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags) { - return generic_get_unmapped_area_topdown(filp, addr, len, pgoff, flags); + return generic_get_unmapped_area_topdown(filp, addr, len, pgoff, flags, + vm_flags); } #endif From df7e1286b1dc3d6cff952cf3ef4f0c36831e2fbb Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Wed, 4 Sep 2024 17:58:01 +0100 Subject: [PATCH 361/416] mm: care about shadow stack guard gap when getting an unmapped area As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. x86 uses a single page guard, this is also sufficient for arm64 where we either do single word pops and pushes or unconstrained writes. Architectures which do not have this feature will define VM_SHADOW_STACK to VM_NONE and hence be unaffected. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-3-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown Suggested-by: Rick Edgecombe Acked-by: Lorenzo Stoakes Cc: Alexander Gordeev Cc: Andreas Larsson Cc: Borislav Petkov Cc: Christian Borntraeger Cc: Christophe Leroy Cc: Chris Zankel Cc: Dave Hansen Cc: David S. Miller Cc: Gerald Schaefer Cc: Guo Ren Cc: Heiko Carstens Cc: Helge Deller Cc: "H. Peter Anvin" Cc: Huacai Chen Cc: Ingo Molnar Cc: Ivan Kokshaysky Cc: James Bottomley Cc: John Paul Adrian Glaubitz Cc: Liam R. Howlett Cc: Matt Turner Cc: Max Filippov Cc: Michael Ellerman Cc: Naveen N Rao Cc: Nicholas Piggin Cc: Richard Henderson Cc: Rich Felker Cc: Russell King Cc: Sven Schnelle Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Vineet Gupta Cc: Vlastimil Babka Cc: WANG Xuerui Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- mm/mmap.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 14d36be65ce7..270bd8d504b8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -702,6 +702,18 @@ retry: return gap; } +/* + * Determine if the allocation needs to ensure that there is no + * existing mapping within it's guard gaps, for use as start_gap. + */ +static inline unsigned long stack_guard_placement(vm_flags_t vm_flags) +{ + if (vm_flags & VM_SHADOW_STACK) + return PAGE_SIZE; + + return 0; +} + /* * Search for an unmapped address range. * @@ -763,6 +775,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = mm->mmap_base; info.high_limit = mmap_end; + info.start_gap = stack_guard_placement(vm_flags); return vm_unmapped_area(&info); } @@ -812,6 +825,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = arch_get_mmap_base(addr, mm->mmap_base); + info.start_gap = stack_guard_placement(vm_flags); addr = vm_unmapped_area(&info); /* From ec867977fed07b599fd9f24c6950628f88cbc29a Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 4 Sep 2024 20:54:19 +0000 Subject: [PATCH 362/416] mm: page_alloc: fix missed updates of PGFREE in free_unref_{page/folios} PGFREE is currently updated in two code paths: - __free_pages_ok(): for pages freed to the buddy allocator. - free_unref_page_commit(): for pages freed to the pcplists. Before commit df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock"), free_unref_page_commit() used to fallback to freeing isolated pages directly to the buddy allocator through free_one_page(). This was done _after_ updating PGFREE, so the counter was correctly updated. However, that commit moved the fallback logic to its callers (now called free_unref_page() and free_unref_folios()), so PGFREE was no longer updated in this fallback case. Now that the code has developed, there are more cases in free_unref_page() and free_unref_folios() where we fallback to calling free_one_page() (e.g. !pcp_allowed_order(), pcp_spin_trylock() fails). These cases also miss updating PGFREE. To make sure PGFREE is updated in all cases where pages are freed to the buddy allocator, move the update down the stack to free_one_page(). This was noticed through code inspection, although it should be noticeable at runtime (at least with some workloads). Link: https://lkml.kernel.org/r/20240904205419.821776-1-yosryahmed@google.com Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock") Signed-off-by: Yosry Ahmed Cc: Brendan Jackman Cc: Mel Gorman Cc: Peter Zijlstra Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/page_alloc.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ee377bf5c033..a6adc8f4b7c0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1231,6 +1231,8 @@ static void free_one_page(struct zone *zone, struct page *page, spin_lock_irqsave(&zone->lock, flags); split_large_buddy(zone, page, pfn, order, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); + + __count_vm_events(PGFREE, 1 << order); } static void __free_pages_ok(struct page *page, unsigned int order, @@ -1239,12 +1241,8 @@ static void __free_pages_ok(struct page *page, unsigned int order, unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page); - if (!free_pages_prepare(page, order)) - return; - - free_one_page(zone, page, pfn, order, fpi_flags); - - __count_vm_events(PGFREE, 1 << order); + if (free_pages_prepare(page, order)) + free_one_page(zone, page, pfn, order, fpi_flags); } void __meminit __free_pages_core(struct page *page, unsigned int order, From 08e28de1160a712724268fd33d77b32f1bc84d1c Mon Sep 17 00:00:00 2001 From: Sven Schnelle Date: Tue, 3 Sep 2024 09:36:28 +0200 Subject: [PATCH 363/416] uprobes: use vm_special_mapping close() functionality The following KASAN splat was shown: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). Fix this by moving uprobe_clear_state() to uprobes.c and use it as close() callback. [usama.anjum@collabora.com: remove unneeded condition] Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Sven Schnelle Suggested-by: Linus Torvalds Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Masami Hiramatsu Cc: Michael Ellerman Cc: Namhyung Kim Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton --- include/linux/uprobes.h | 1 - kernel/events/uprobes.c | 36 +++++++++++++++++------------------- kernel/fork.c | 1 - 3 files changed, 17 insertions(+), 21 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..493dc95d912c 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -126,7 +126,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); extern void uprobe_notify_resume(struct pt_regs *regs); extern bool uprobe_deny_signal(void); extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); -extern void uprobe_clear_state(struct mm_struct *mm); extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 73cc47708679..c2d6b2d64de2 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1482,6 +1482,22 @@ void * __weak arch_uprobe_trampoline(unsigned long *psize) return &insn; } +/* + * uprobe_clear_state - Free the area allocated for slots. + */ +static void uprobe_clear_state(const struct vm_special_mapping *sm, struct vm_area_struct *vma) +{ + struct xol_area *area = container_of(vma->vm_private_data, struct xol_area, xol_mapping); + + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(NULL, vma->vm_mm); + mutex_unlock(&delayed_uprobe_lock); + + put_page(area->pages[0]); + kfree(area->bitmap); + kfree(area); +} + static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; @@ -1500,6 +1516,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) area->xol_mapping.name = "[uprobes]"; area->xol_mapping.fault = NULL; + area->xol_mapping.close = uprobe_clear_state; area->xol_mapping.pages = area->pages; area->pages[0] = alloc_page(GFP_HIGHUSER); if (!area->pages[0]) @@ -1545,25 +1562,6 @@ static struct xol_area *get_xol_area(void) return area; } -/* - * uprobe_clear_state - Free the area allocated for slots. - */ -void uprobe_clear_state(struct mm_struct *mm) -{ - struct xol_area *area = mm->uprobes_state.xol_area; - - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(NULL, mm); - mutex_unlock(&delayed_uprobe_lock); - - if (!area) - return; - - put_page(area->pages[0]); - kfree(area->bitmap); - kfree(area); -} - void uprobe_start_dup_mmap(void) { percpu_down_read(&dup_mmap_sem); diff --git a/kernel/fork.c b/kernel/fork.c index 3d590e51ce84..61070248a7d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1338,7 +1338,6 @@ static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); - uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ From 25e8acbcf19c56fbf0d80615c8f409586aabbf86 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 5 Sep 2024 09:24:23 -0700 Subject: [PATCH 364/416] mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero The aggregation interval of test purpose damon_attrs for damon_test_nr_accesses_to_accesses_bp() becomes zero on 32 bit architecture, since size of int and long types are same. As a result, damon_nr_accesses_to_accesses_bp() call with the test data triggers divide-by-zero exception. damon_nr_accesses_to_accesses_bp() shouldn't be called with such data, and the non-test code avoids that by checking the case on damon_update_monitoring_results(). Skip the test code in the case, and add an explicit caution of the case on the comment for the test target function. Link: https://lkml.kernel.org/r/20240905162423.74053-1-sj@kernel.org Fixes: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero") Signed-off-by: SeongJae Park Reported-by: Guenter Roeck Closes: https://lore.kernel.org/c771b962-a58f-435b-89e4-1211a9323181@roeck-us.net Cc: Brendan Higgins Cc: David Gow Cc: Guenter Roeck Signed-off-by: Andrew Morton --- mm/damon/core.c | 8 +++++++- mm/damon/tests/core-kunit.h | 12 ++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/mm/damon/core.c b/mm/damon/core.c index 8b99c5a99c38..a83f3b736d51 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -552,7 +552,13 @@ static unsigned int damon_accesses_bp_to_nr_accesses( return accesses_bp * damon_max_nr_accesses(attrs) / 10000; } -/* convert nr_accesses to access ratio in bp (per 10,000) */ +/* + * Convert nr_accesses to access ratio in bp (per 10,000). + * + * Callers should ensure attrs.aggr_interval is not zero, like + * damon_update_monitoring_results() does . Otherwise, divide-by-zero would + * happen. + */ static unsigned int damon_nr_accesses_to_accesses_bp( unsigned int nr_accesses, struct damon_attrs *attrs) { diff --git a/mm/damon/tests/core-kunit.h b/mm/damon/tests/core-kunit.h index ae03df71737e..cf22e09a3507 100644 --- a/mm/damon/tests/core-kunit.h +++ b/mm/damon/tests/core-kunit.h @@ -320,6 +320,18 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test) .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10 }; + /* + * In some cases such as 32bit architectures where UINT_MAX is + * ULONG_MAX, attrs.aggr_interval becomes zero. Calling + * damon_nr_accesses_to_accesses_bp() in the case will cause + * divide-by-zero. Such case is prohibited in normal execution since + * the caution is documented on the comment for the function, and + * damon_update_monitoring_results() does the check. Skip the test in + * the case. + */ + if (!attrs.aggr_interval) + kunit_skip(test, "aggr_interval is zero."); + KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0); } From 46dcc7c92e63879a49cfbd99949858df0335a122 Mon Sep 17 00:00:00 2001 From: Nanyong Sun Date: Thu, 5 Sep 2024 23:31:18 +0800 Subject: [PATCH 365/416] mm: migrate: simplify find_mm_struct() Use find_get_task_by_vpid() to replace the task_struct find logic in find_mm_struct(), note that this patch move the ptrace_may_access() call out from rcu_read_lock() scope, this is ok because it actually does not need it, find_get_task_by_vpid() already get the pid and task safely, ptrace_may_access() can use the task safely, like what sched_core_share_pid() similarly do. Link: https://lkml.kernel.org/r/20240905153118.1205173-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun Cc: Kefeng Wang Signed-off-by: Andrew Morton --- mm/migrate.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 35cc9d35064b..6a68d9923820 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2505,25 +2505,19 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes) return current->mm; } - /* Find the mm_struct */ - rcu_read_lock(); - task = find_task_by_vpid(pid); + task = find_get_task_by_vpid(pid); if (!task) { - rcu_read_unlock(); return ERR_PTR(-ESRCH); } - get_task_struct(task); /* * Check if this process has the right to modify the specified * process. Use the regular "ptrace_may_access()" checks. */ if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) { - rcu_read_unlock(); mm = ERR_PTR(-EPERM); goto out; } - rcu_read_unlock(); mm = ERR_PTR(security_task_movememory(task)); if (IS_ERR(mm)) From e4bfc678579eb32360cdd09503c8fe62f943bc3f Mon Sep 17 00:00:00 2001 From: Nanyong Sun Date: Thu, 5 Sep 2024 23:30:28 +0800 Subject: [PATCH 366/416] mm: thp: simplify split_huge_pages_pid() The helper find_get_task_by_vpid() can totally replace the task_struct find logic in split_huge_pages_pid(), so use it to simplify the code. Also delete the needless comments for the helper function name already explains what it's doing here. Link: https://lkml.kernel.org/r/20240905153028.1205128-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun Cc: Kefeng Wang Signed-off-by: Andrew Morton --- mm/huge_memory.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 77092581f90d..f15f7faf2a63 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3819,16 +3819,11 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, vaddr_start &= PAGE_MASK; vaddr_end &= PAGE_MASK; - /* Find the task_struct from pid */ - rcu_read_lock(); - task = find_task_by_vpid(pid); + task = find_get_task_by_vpid(pid); if (!task) { - rcu_read_unlock(); ret = -ESRCH; goto out; } - get_task_struct(task); - rcu_read_unlock(); /* Find the mm_struct */ mm = get_task_mm(task); From cfc8193898ca4511e6c3cc20101ef3554eda8efb Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Thu, 5 Sep 2024 23:24:32 +0800 Subject: [PATCH 367/416] mm: migrate: remove unused includes random.h is not needed since commit 6c542ab75714 ("mm/demotion: build demotion targets based on explicit memory tiers"), all functions moved into memory-tiers. nsproxy.h is not needed since commit 228ebcbe634a ("Uninline find_task_by_xxx set of functions"), no nsproxy, we only call find_task_by_vpid() now. hugetlb_cgroup.h is not needed since commit ab5ac90aecf5 ("mm, hugetlb: do not rely on overcommit limit during migration"), move_hugetlb_state() is called and it belongs to hugetlb.h, which is already included. balloon_compaction.h, we have more general movable_operations for non-lru movable page migration, so it could be dropped. memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page migration, but all functions are moved into migrate_device.c, so no needed anymore too. Link: https://lkml.kernel.org/r/20240905152432.626877-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Signed-off-by: Andrew Morton --- mm/migrate.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 6a68d9923820..0f6b78fd73aa 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -35,19 +34,13 @@ #include #include #include -#include #include #include -#include -#include -#include #include #include #include #include -#include #include -#include #include #include #include From 6e94da943ba3ebd8c603872010ff6efcb682fe9c Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 5 Sep 2024 14:22:20 -0700 Subject: [PATCH 368/416] mm/page_alloc: fix build with CONFIG_UNACCEPTED_MEMORY=n When has_unaccepted_memory() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/page_alloc.c:7036:20: error: unused function 'has_unaccepted_memory' [-Werror,-Wunused-function] 7036 | static inline bool has_unaccepted_memory(void) | ^~~~~~~~~~~~~~~~~~~~~ Fix it by removeing the CONFIG_UNACCEPTED_MEMORY=n stub. Link: https://lkml.kernel.org/r/20240905142220.49d93337a0abce5690e515d9@linux-foundation.org Reported-by: Andy Shevchenko Closes: https://lkml.kernel.org/r/20240905171553.275054-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andrew Morton --- mm/page_alloc.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a6adc8f4b7c0..74f13f676985 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -287,7 +287,6 @@ EXPORT_SYMBOL(nr_online_nodes); static bool page_contains_unaccepted(struct page *page, unsigned int order); static bool cond_accept_memory(struct zone *zone, unsigned int order); -static inline bool has_unaccepted_memory(void); static bool __free_unaccepted(struct page *page); int page_group_by_mobility_disabled __read_mostly; @@ -7061,6 +7060,11 @@ static bool try_to_accept_memory_one(struct zone *zone) return true; } +static inline bool has_unaccepted_memory(void) +{ + return static_branch_unlikely(&zones_with_unaccepted_pages); +} + static bool cond_accept_memory(struct zone *zone, unsigned int order) { long to_accept; @@ -7088,11 +7092,6 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) return ret; } -static inline bool has_unaccepted_memory(void) -{ - return static_branch_unlikely(&zones_with_unaccepted_pages); -} - static bool __free_unaccepted(struct page *page) { struct zone *zone = page_zone(page); @@ -7128,11 +7127,6 @@ static bool cond_accept_memory(struct zone *zone, unsigned int order) return false; } -static inline bool has_unaccepted_memory(void) -{ - return false; -} - static bool __free_unaccepted(struct page *page) { BUILD_BUG(); From 0e40cf2a8b2c847950e025d5aa594bd545118d26 Mon Sep 17 00:00:00 2001 From: Kinsey Ho Date: Thu, 5 Sep 2024 00:30:50 +0000 Subject: [PATCH 369/416] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com Suggested-by: Yosry Ahmed Reviewed-by: Michal Koutný Signed-off-by: Kinsey Ho Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Zefan Li Cc: Hugh Dickins Cc: T.J. Mercier Signed-off-by: Andrew Morton --- include/linux/cgroup-defs.h | 6 +++++- kernel/cgroup/cgroup.c | 16 +++++++++------- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 7fc2d0195f56..ca7e912b8355 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -172,7 +172,11 @@ struct cgroup_subsys_state { /* reference count - access via css_[try]get() and css_put() */ struct percpu_ref refcnt; - /* siblings list anchored at the parent's ->children */ + /* + * siblings list anchored at the parent's ->children + * + * linkage is protected by cgroup_mutex or RCU + */ struct list_head sibling; struct list_head children; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 0a97cb2ef124..ece2316e2bca 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4602,8 +4602,9 @@ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos, * * While this function requires cgroup_mutex or RCU read locking, it * doesn't require the whole traversal to be contained in a single critical - * section. This function will return the correct next descendant as long - * as both @pos and @root are accessible and @pos is a descendant of @root. + * section. Additionally, it isn't necessary to hold onto a reference to @pos. + * This function will return the correct next descendant as long as both @pos + * and @root are accessible and @pos is a descendant of @root. * * If a subsystem synchronizes ->css_online() and the start of iteration, a * css which finished ->css_online() is guaranteed to be visible in the @@ -4651,8 +4652,9 @@ EXPORT_SYMBOL_GPL(css_next_descendant_pre); * * While this function requires cgroup_mutex or RCU read locking, it * doesn't require the whole traversal to be contained in a single critical - * section. This function will return the correct rightmost descendant as - * long as @pos is accessible. + * section. Additionally, it isn't necessary to hold onto a reference to @pos. + * This function will return the correct rightmost descendant as long as @pos + * is accessible. */ struct cgroup_subsys_state * css_rightmost_descendant(struct cgroup_subsys_state *pos) @@ -4696,9 +4698,9 @@ css_leftmost_descendant(struct cgroup_subsys_state *pos) * * While this function requires cgroup_mutex or RCU read locking, it * doesn't require the whole traversal to be contained in a single critical - * section. This function will return the correct next descendant as long - * as both @pos and @cgroup are accessible and @pos is a descendant of - * @cgroup. + * section. Additionally, it isn't necessary to hold onto a reference to @pos. + * This function will return the correct next descendant as long as both @pos + * and @cgroup are accessible and @pos is a descendant of @cgroup. * * If a subsystem synchronizes ->css_online() and the start of iteration, a * css which finished ->css_online() is guaranteed to be visible in the From 4a2698b0133b5c477515ccbedff169fec722d0e7 Mon Sep 17 00:00:00 2001 From: Kinsey Ho Date: Thu, 5 Sep 2024 00:30:51 +0000 Subject: [PATCH 370/416] mm: don't hold css->refcnt during traversal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To obtain the pointer to the next memcg position, mem_cgroup_iter() currently holds css->refcnt during memcg traversal only to put css->refcnt at the end of the routine. This isn't necessary as an rcu_read_lock is already held throughout the function. The use of the RCU read lock with css_next_descendant_pre() guarantees that sibling linkage is safe without holding a ref on the passed-in @css. Remove css->refcnt usage during traversal by leveraging RCU. Link: https://lkml.kernel.org/r/20240905003058.1859929-3-kinseyho@google.com Signed-off-by: Kinsey Ho Reviewed-by: T.J. Mercier Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Yosry Ahmed Cc: Zefan Li Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/memcontrol.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5abff3b0cae5..8f6cf651cc79 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1013,20 +1013,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, else if (reclaim->generation != iter->generation) goto out_unlock; - while (1) { - pos = READ_ONCE(iter->position); - if (!pos || css_tryget(&pos->css)) - break; - /* - * css reference reached zero, so iter->position will - * be cleared by ->css_released. However, we should not - * rely on this happening soon, because ->css_released - * is called from a work queue, and by busy-waiting we - * might block it. So we clear iter->position right - * away. - */ - (void)cmpxchg(&iter->position, pos, NULL); - } + pos = READ_ONCE(iter->position); } else if (prev) { pos = prev; } @@ -1067,9 +1054,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, */ (void)cmpxchg(&iter->position, pos, memcg); - if (pos) - css_put(&pos->css); - if (!memcg) iter->generation++; } From 3d150e31a1f62f0b39bdd8b968f563f8f3915d7a Mon Sep 17 00:00:00 2001 From: Kinsey Ho Date: Thu, 5 Sep 2024 00:30:52 +0000 Subject: [PATCH 371/416] mm: increment gen # before restarting traversal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The generation number in struct mem_cgroup_reclaim_iter should be incremented on every round-trip. Currently, it is possible for a concurrent reclaimer to jump in at the end of the hierarchy, causing a traversal restart (resetting the iteration position) without incrementing the generation number. By resetting the position without incrementing the generation, it's possible for another ongoing mem_cgroup_iter() thread to walk the tree twice. Move the traversal restart such that the generation number is incremented before the restart. Link: https://lkml.kernel.org/r/20240905003058.1859929-4-kinseyho@google.com Signed-off-by: Kinsey Ho Reviewed-by: T.J. Mercier Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Yosry Ahmed Cc: Zefan Li Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/memcontrol.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f6cf651cc79..38e3251ed482 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -997,7 +997,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, root = root_mem_cgroup; rcu_read_lock(); - +restart: if (reclaim) { struct mem_cgroup_per_node *mz; @@ -1024,14 +1024,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, for (;;) { css = css_next_descendant_pre(css, &root->css); if (!css) { - /* - * Reclaimers share the hierarchy walk, and a - * new one might jump in right at the end of - * the hierarchy - make sure they see at least - * one group and restart from the beginning. - */ - if (!prev) - continue; break; } @@ -1054,8 +1046,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, */ (void)cmpxchg(&iter->position, pos, memcg); - if (!memcg) + if (!memcg) { iter->generation++; + + /* + * Reclaimers share the hierarchy walk, and a + * new one might jump in right at the end of + * the hierarchy - make sure they see at least + * one group and restart from the beginning. + */ + if (!prev) + goto restart; + } } out_unlock: From ec0db74b4b1f249ffca4df450f54c17573114045 Mon Sep 17 00:00:00 2001 From: Kinsey Ho Date: Thu, 5 Sep 2024 00:30:53 +0000 Subject: [PATCH 372/416] mm: restart if multiple traversals raced MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, if multiple reclaimers raced on the same position, the reclaimers which detect the race will still reclaim from the same memcg. Instead, the reclaimers which detect the race should move on to the next memcg in the hierarchy. So, in the case where multiple traversals race, jump back to the start of the mem_cgroup_iter() function to find the next memcg in the hierarchy to reclaim from. Link: https://lkml.kernel.org/r/20240905003058.1859929-5-kinseyho@google.com Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/ Signed-off-by: Kinsey Ho Reviewed-by: T.J. Mercier Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Yosry Ahmed Cc: Zefan Li Cc: Hugh Dickins Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 4 ++-- mm/memcontrol.c | 26 +++++++++++++++++--------- 2 files changed, 19 insertions(+), 11 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fe05fdb92779..2ef94c74847d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -57,7 +57,7 @@ enum memcg_memory_event { struct mem_cgroup_reclaim_cookie { pg_data_t *pgdat; - unsigned int generation; + int generation; }; #ifdef CONFIG_MEMCG @@ -78,7 +78,7 @@ struct lruvec_stats; struct mem_cgroup_reclaim_iter { struct mem_cgroup *position; /* scan generation, increased every round-trip */ - unsigned int generation; + atomic_t generation; }; /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 38e3251ed482..36cbc7730745 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -986,8 +986,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup_reclaim_cookie *reclaim) { struct mem_cgroup_reclaim_iter *iter; - struct cgroup_subsys_state *css = NULL; - struct mem_cgroup *memcg = NULL; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; struct mem_cgroup *pos = NULL; if (mem_cgroup_disabled()) @@ -998,19 +998,23 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, rcu_read_lock(); restart: + memcg = NULL; + if (reclaim) { + int gen; struct mem_cgroup_per_node *mz; mz = root->nodeinfo[reclaim->pgdat->node_id]; iter = &mz->iter; + gen = atomic_read(&iter->generation); /* * On start, join the current reclaim iteration cycle. * Exit when a concurrent walker completes it. */ if (!prev) - reclaim->generation = iter->generation; - else if (reclaim->generation != iter->generation) + reclaim->generation = gen; + else if (reclaim->generation != gen) goto out_unlock; pos = READ_ONCE(iter->position); @@ -1018,8 +1022,7 @@ restart: pos = prev; } - if (pos) - css = &pos->css; + css = pos ? &pos->css : NULL; for (;;) { css = css_next_descendant_pre(css, &root->css); @@ -1033,21 +1036,26 @@ restart: * and kicking, and don't take an extra reference. */ if (css == &root->css || css_tryget(css)) { - memcg = mem_cgroup_from_css(css); break; } } + memcg = mem_cgroup_from_css(css); + if (reclaim) { /* * The position could have already been updated by a competing * thread, so check that the value hasn't changed since we read * it to avoid reclaiming from the same cgroup twice. */ - (void)cmpxchg(&iter->position, pos, memcg); + if (cmpxchg(&iter->position, pos, memcg) != pos) { + if (css && css != &root->css) + css_put(css); + goto restart; + } if (!memcg) { - iter->generation++; + atomic_inc(&iter->generation); /* * Reclaimers share the hierarchy walk, and a From aa50b501c0529677c27211c79c5b81de60af20a1 Mon Sep 17 00:00:00 2001 From: Kinsey Ho Date: Thu, 5 Sep 2024 00:30:54 +0000 Subject: [PATCH 373/416] mm: clean up mem_cgroup_iter() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A clean up to make variable names more clear and to improve code readability. No functional change. Link: https://lkml.kernel.org/r/20240905003058.1859929-6-kinseyho@google.com Signed-off-by: Kinsey Ho Reviewed-by: T.J. Mercier Cc: Johannes Weiner Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Cc: Yosry Ahmed Cc: Zefan Li Cc: Hugh Dickins Signed-off-by: Andrew Morton --- mm/memcontrol.c | 32 ++++++++++++-------------------- 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 36cbc7730745..d632650c6a2b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -987,8 +987,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, { struct mem_cgroup_reclaim_iter *iter; struct cgroup_subsys_state *css; - struct mem_cgroup *memcg; - struct mem_cgroup *pos = NULL; + struct mem_cgroup *pos; + struct mem_cgroup *next; if (mem_cgroup_disabled()) return NULL; @@ -998,14 +998,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, rcu_read_lock(); restart: - memcg = NULL; + next = NULL; if (reclaim) { int gen; - struct mem_cgroup_per_node *mz; + int nid = reclaim->pgdat->node_id; - mz = root->nodeinfo[reclaim->pgdat->node_id]; - iter = &mz->iter; + iter = &root->nodeinfo[nid]->iter; gen = atomic_read(&iter->generation); /* @@ -1018,29 +1017,22 @@ restart: goto out_unlock; pos = READ_ONCE(iter->position); - } else if (prev) { + } else pos = prev; - } css = pos ? &pos->css : NULL; - for (;;) { - css = css_next_descendant_pre(css, &root->css); - if (!css) { - break; - } - + while ((css = css_next_descendant_pre(css, &root->css))) { /* * Verify the css and acquire a reference. The root * is provided by the caller, so we know it's alive * and kicking, and don't take an extra reference. */ - if (css == &root->css || css_tryget(css)) { + if (css == &root->css || css_tryget(css)) break; - } } - memcg = mem_cgroup_from_css(css); + next = mem_cgroup_from_css(css); if (reclaim) { /* @@ -1048,13 +1040,13 @@ restart: * thread, so check that the value hasn't changed since we read * it to avoid reclaiming from the same cgroup twice. */ - if (cmpxchg(&iter->position, pos, memcg) != pos) { + if (cmpxchg(&iter->position, pos, next) != pos) { if (css && css != &root->css) css_put(css); goto restart; } - if (!memcg) { + if (!next) { atomic_inc(&iter->generation); /* @@ -1073,7 +1065,7 @@ out_unlock: if (prev && prev != root) css_put(&prev->css); - return memcg; + return next; } /** From 1930c6ad93ad01f82bb7965bbc04eb5a763f856d Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Fri, 6 Sep 2024 22:15:06 -0400 Subject: [PATCH 374/416] maple_tree: mark three functions as __maybe_unused People keep trying to remove three functions that are going to be used in a feature that is being developed. Dropping the functions entirely may end up with people trying to use the bit for other uses, as people have tried in the past. Adding __maybe_unused stops compilers complaining about the unused functions so they can be silently optimised out of the compiled code and people won't try to claim the bit for another use. Link: https://lore.kernel.org/all/20230726080916.17454-2-zhangpeng.00@bytedance.com/ Link: https://lore.kernel.org/all/202408310728.S7EE59BN-lkp@intel.com/ Link: https://lkml.kernel.org/r/20240907021506.4018676-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Kuan-Wei Chiu Signed-off-by: Andrew Morton --- lib/maple_tree.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2058452e5ff3..2a57e489b206 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -348,17 +348,17 @@ static inline void *mte_safe_root(const struct maple_enode *node) return (void *)((unsigned long)node & ~MAPLE_ROOT_NODE); } -static inline void *mte_set_full(const struct maple_enode *node) +static inline void __maybe_unused *mte_set_full(const struct maple_enode *node) { return (void *)((unsigned long)node & ~MAPLE_ENODE_NULL); } -static inline void *mte_clear_full(const struct maple_enode *node) +static inline void __maybe_unused *mte_clear_full(const struct maple_enode *node) { return (void *)((unsigned long)node | MAPLE_ENODE_NULL); } -static inline bool mte_has_null(const struct maple_enode *node) +static inline bool __maybe_unused mte_has_null(const struct maple_enode *node) { return (unsigned long)node & MAPLE_ENODE_NULL; } From 354a595a4a4d9dfc0d3e5703c6c5520e6c2f52d8 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Fri, 6 Sep 2024 16:05:12 -0700 Subject: [PATCH 375/416] mm: replace xa_get_order with xas_get_order where appropriate The tracing of invalidation and truncation operations on large files showed that xa_get_order() is among the top functions where kernel spends a lot of CPUs. xa_get_order() needs to traverse the tree to reach the right node for a given index and then extract the order of the entry. However it seems like at many places it is being called within an already happening tree traversal where there is no need to do another traversal. Just use xas_get_order() at those places. Link: https://lkml.kernel.org/r/20240906230512.124643-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reviewed-by: Liam R. Howlett Cc: Baolin Wang Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Nhat Pham Signed-off-by: Andrew Morton --- mm/filemap.c | 6 +++--- mm/shmem.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 7eb4637ea199..6f14d80feb37 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2112,7 +2112,7 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), folio); } else { - nr = 1 << xa_get_order(&mapping->i_pages, xas.xa_index); + nr = 1 << xas_get_order(&xas); base = xas.xa_index & ~(nr - 1); /* Omit order>0 value which begins before the start */ if (base < *start) @@ -3001,7 +3001,7 @@ unlock: static inline size_t seek_folio_size(struct xa_state *xas, struct folio *folio) { if (xa_is_value(folio)) - return PAGE_SIZE << xa_get_order(xas->xa, xas->xa_index); + return PAGE_SIZE << xas_get_order(xas); return folio_size(folio); } @@ -4297,7 +4297,7 @@ static void filemap_cachestat(struct address_space *mapping, if (xas_retry(&xas, folio)) continue; - order = xa_get_order(xas.xa, xas.xa_index); + order = xas_get_order(&xas); nr_pages = 1 << order; folio_first_index = round_down(xas.xa_index, 1 << order); folio_last_index = folio_first_index + nr_pages - 1; diff --git a/mm/shmem.c b/mm/shmem.c index 74f093d88c78..361affdf3990 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -890,7 +890,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping, if (xas_retry(&xas, page)) continue; if (xa_is_value(page)) - swapped += 1 << xa_get_order(xas.xa, xas.xa_index); + swapped += 1 << xas_get_order(&xas); if (xas.xa_index == max) break; if (need_resched()) { From b44f71e3fa35e1d8076b73316abff155ef1aa26c Mon Sep 17 00:00:00 2001 From: ZhangPeng Date: Fri, 6 Sep 2024 18:25:39 +0800 Subject: [PATCH 376/416] mm/vmalloc.c: use helper function va_size() Use helper function va_size() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20240906102539.3537207-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng Reviewed-by: Uladzislau Rezki (Sony) Cc: Christoph Hellwig Signed-off-by: Andrew Morton --- mm/vmalloc.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index efba551d81aa..cf2af80157bc 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1940,7 +1940,7 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm, { vm->flags = flags; vm->addr = (void *)va->va_start; - vm->size = va->va_end - va->va_start; + vm->size = va_size(va); vm->caller = caller; va->vm = vm; } @@ -2018,7 +2018,7 @@ retry: if (vm) { vm->addr = (void *)va->va_start; - vm->size = va->va_end - va->va_start; + vm->size = va_size(va); va->vm = vm; } @@ -2192,7 +2192,7 @@ static void purge_vmap_node(struct work_struct *work) vn->nr_purged = 0; list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { - unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; + unsigned long nr = va_size(va) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; unsigned int vn_id = decode_vn_id(va->flags); @@ -2336,8 +2336,8 @@ static void free_vmap_area_noflush(struct vmap_area *va) if (WARN_ON_ONCE(!list_empty(&va->list))) return; - nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> - PAGE_SHIFT, &vmap_lazy_nr); + nr_lazy = atomic_long_add_return(va_size(va) >> PAGE_SHIFT, + &vmap_lazy_nr); /* * If it was request by a certain node we would like to @@ -2933,8 +2933,7 @@ void vm_unmap_ram(const void *mem, unsigned int count) if (WARN_ON_ONCE(!va)) return; - debug_check_no_locks_freed((void *)va->va_start, - (va->va_end - va->va_start)); + debug_check_no_locks_freed((void *)va->va_start, va_size(va)); free_unmap_vmap_area(va); } EXPORT_SYMBOL(vm_unmap_ram); @@ -4932,7 +4931,7 @@ static void show_purge_info(struct seq_file *m) list_for_each_entry(va, &vn->lazy.head, list) { seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", (void *)va->va_start, (void *)va->va_end, - va->va_end - va->va_start); + va_size(va)); } spin_unlock(&vn->lazy.lock); } @@ -4954,7 +4953,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) if (va->flags & VMAP_RAM) seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", (void *)va->va_start, (void *)va->va_end, - va->va_end - va->va_start); + va_size(va)); continue; } From 6004fe001d6c88ad8c51e1779989e69694cc9ced Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Fri, 6 Sep 2024 11:50:49 +0200 Subject: [PATCH 377/416] mm/vmalloc.c: use "high-order" in description non 0-order pages In many places, in the comments, we use both "higher-order" and "high-order" to describe the non 0-order pages. That is confusing, because a "higher-order" statement does not reflect what it is compared with. Link: https://lkml.kernel.org/r/20240906095049.3486-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Suggested-by: Baoquan He Reviewed-by: Baoquan He Cc: Christoph Hellwig Cc: Michal Hocko Cc: Oleksiy Avramchenko Signed-off-by: Andrew Morton --- mm/vmalloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index cf2af80157bc..166e12c90ff6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3570,7 +3570,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, break; /* - * Higher order allocations must be able to be treated as + * High-order allocations must be able to be treated as * independent small pages by callers (as they can with * small-page vmallocs). Some drivers do their own refcounting * on vmalloc_to_page() pages, some use page->mapping, @@ -3633,7 +3633,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, page_order = vm_area_page_order(area); /* - * Higher order nofail allocations are really expensive and + * High-order nofail allocations are really expensive and * potentially dangerous (pre-mature OOM, disruptive reclaim * and compaction etc. * From eebc0f48468e68b93393d11f75ad17385a9acca8 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Thu, 5 Sep 2024 22:21:06 -0600 Subject: [PATCH 378/416] mm/codetag: fix a typo Link: https://lkml.kernel.org/r/20240906042108.1150526-1-yuzhao@google.com Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Yu Zhao Acked-by: Suren Baghdasaryan Acked-by: Muchun Song Cc: Kent Overstreet Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 8c61ccd161ba..896491d9ebe8 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -70,7 +70,7 @@ static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct) /* * When percpu variables are required to be defined as weak, static percpu * variables can't be used inside a function (see comments for DECLARE_PER_CPU_SECTION). - * Instead we will accound all module allocations to a single counter. + * Instead we will account all module allocations to a single counter. */ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag); From 95599ef684d01136a8b77c16a7c853496786e173 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Thu, 5 Sep 2024 22:21:07 -0600 Subject: [PATCH 379/416] mm/codetag: fix pgalloc_tag_split() The current assumption is that a large folio can only be split into order-0 folios. That is not the case for hugeTLB demotion, nor for THP split: see commit c010d47f107f ("mm: thp: split huge page to any lower order pages"). When a large folio is split into ones of a lower non-zero order, only the new head pages should be tagged. Tagging tail pages can cause imbalanced "calls" counters, since only head pages are untagged by pgalloc_tag_sub() and the "calls" counts on tail pages are leaked, e.g., # echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size # echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote # echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # grep alloc_gigantic_folio /proc/allocinfo Before this patch: 0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m2.057s user 0m0.000s sys 0m2.051s After this patch: 0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m1.711s user 0m0.000s sys 0m1.704s Not tagging tail pages also improves the splitting time, e.g., by about 15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above. Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting") Signed-off-by: Yu Zhao Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- include/linux/mm.h | 30 ++++++++++++++++++++++++++++++ include/linux/pgalloc_tag.h | 31 ------------------------------- mm/huge_memory.c | 2 +- mm/hugetlb.c | 2 +- mm/page_alloc.c | 4 ++-- 5 files changed, 34 insertions(+), 35 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index b0ff06d18c71..6bb778cbaabf 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4084,4 +4084,34 @@ void vma_pgtable_walk_end(struct vm_area_struct *vma); int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *size); +#ifdef CONFIG_MEM_ALLOC_PROFILING +static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order) +{ + int i; + struct alloc_tag *tag; + unsigned int nr_pages = 1 << new_order; + + if (!mem_alloc_profiling_enabled()) + return; + + tag = pgalloc_tag_get(&folio->page); + if (!tag) + return; + + for (i = nr_pages; i < (1 << old_order); i += nr_pages) { + union codetag_ref *ref = get_page_tag_ref(folio_page(folio, i)); + + if (ref) { + /* Set new reference to point to the original tag */ + alloc_tag_ref_set(ref, tag); + put_page_tag_ref(ref); + } + } +} +#else /* !CONFIG_MEM_ALLOC_PROFILING */ +static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order) +{ +} +#endif /* CONFIG_MEM_ALLOC_PROFILING */ + #endif /* _LINUX_MM_H */ diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h index 207f0c83c8e9..59a3deb792a8 100644 --- a/include/linux/pgalloc_tag.h +++ b/include/linux/pgalloc_tag.h @@ -80,36 +80,6 @@ static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) } } -static inline void pgalloc_tag_split(struct page *page, unsigned int nr) -{ - int i; - struct page_ext *first_page_ext; - struct page_ext *page_ext; - union codetag_ref *ref; - struct alloc_tag *tag; - - if (!mem_alloc_profiling_enabled()) - return; - - first_page_ext = page_ext = page_ext_get(page); - if (unlikely(!page_ext)) - return; - - ref = codetag_ref_from_page_ext(page_ext); - if (!ref->ct) - goto out; - - tag = ct_to_alloc_tag(ref->ct); - page_ext = page_ext_next(page_ext); - for (i = 1; i < nr; i++) { - /* Set new reference to point to the original tag */ - alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag); - page_ext = page_ext_next(page_ext); - } -out: - page_ext_put(first_page_ext); -} - static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { struct alloc_tag *tag = NULL; @@ -142,7 +112,6 @@ static inline void clear_page_tag_ref(struct page *page) {} static inline void pgalloc_tag_add(struct page *page, struct task_struct *task, unsigned int nr) {} static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {} -static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {} static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return NULL; } static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {} diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f15f7faf2a63..cc2872f12030 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3226,7 +3226,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, order, new_order); - pgalloc_tag_split(head, 1 << order); + pgalloc_tag_split(folio, order, new_order); /* See comment in __split_huge_page_tail() */ if (folio_test_anon(folio)) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3faf5aad142d..a8624c07d8bf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3778,7 +3778,7 @@ static long demote_free_hugetlb_folios(struct hstate *src, struct hstate *dst, list_del(&folio->lru); split_page_owner(&folio->page, huge_page_order(src), huge_page_order(dst)); - pgalloc_tag_split(&folio->page, 1 << huge_page_order(src)); + pgalloc_tag_split(folio, huge_page_order(src), huge_page_order(dst)); for (i = 0; i < pages_per_huge_page(src); i += pages_per_huge_page(dst)) { struct page *page = folio_page(folio, i); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 74f13f676985..874e006f3d1c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2776,7 +2776,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); split_page_owner(page, order, 0); - pgalloc_tag_split(page, 1 << order); + pgalloc_tag_split(page_folio(page), order, 0); split_page_memcg(page, order, 0); } EXPORT_SYMBOL_GPL(split_page); @@ -4974,7 +4974,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *last = page + nr; split_page_owner(page, order, 0); - pgalloc_tag_split(page, 1 << order); + pgalloc_tag_split(page_folio(page), order, 0); split_page_memcg(page, order, 0); while (page < --last) set_page_refcounted(last); From e0a955bf7f61cb034d228736d81c1ab3a47a3dca Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Thu, 5 Sep 2024 22:21:08 -0600 Subject: [PATCH 380/416] mm/codetag: add pgalloc_tag_copy() Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration rather than lump into the get_new_folio callbacks passed into migrate_pages(), e.g., compaction_alloc(): # echo 1 >/proc/sys/vm/compact_memory # grep compaction_alloc /proc/allocinfo Before this patch: 132968448 32463 mm/compaction.c:1880 func:compaction_alloc After this patch: 0 0 mm/compaction.c:1880 func:compaction_alloc Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Yu Zhao Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 24 ++++++++++-------------- include/linux/mm.h | 27 +++++++++++++++++++++++++++ mm/migrate.c | 1 + 3 files changed, 38 insertions(+), 14 deletions(-) diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 896491d9ebe8..1f0a9ff23a2c 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -137,7 +137,16 @@ static inline void alloc_tag_sub_check(union codetag_ref *ref) {} /* Caller should verify both ref and tag to be valid */ static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) { + alloc_tag_add_check(ref, tag); + if (!ref || !tag) + return; + ref->ct = &tag->ct; +} + +static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) +{ + __alloc_tag_ref_set(ref, tag); /* * We need in increment the call counter every time we have a new * allocation or when we split a large allocation into smaller ones. @@ -147,22 +156,9 @@ static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag this_cpu_inc(tag->counters->calls); } -static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) -{ - alloc_tag_add_check(ref, tag); - if (!ref || !tag) - return; - - __alloc_tag_ref_set(ref, tag); -} - static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes) { - alloc_tag_add_check(ref, tag); - if (!ref || !tag) - return; - - __alloc_tag_ref_set(ref, tag); + alloc_tag_ref_set(ref, tag); this_cpu_add(tag->counters->bytes, bytes); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 6bb778cbaabf..39f6faace590 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4108,10 +4108,37 @@ static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new } } } + +static inline void pgalloc_tag_copy(struct folio *new, struct folio *old) +{ + struct alloc_tag *tag; + union codetag_ref *ref; + + tag = pgalloc_tag_get(&old->page); + if (!tag) + return; + + ref = get_page_tag_ref(&new->page); + if (!ref) + return; + + /* Clear the old ref to the original allocation tag. */ + clear_page_tag_ref(&old->page); + /* Decrement the counters of the tag on get_new_folio. */ + alloc_tag_sub(ref, folio_nr_pages(new)); + + __alloc_tag_ref_set(ref, tag); + + put_page_tag_ref(ref); +} #else /* !CONFIG_MEM_ALLOC_PROFILING */ static inline void pgalloc_tag_split(struct folio *folio, int old_order, int new_order) { } + +static inline void pgalloc_tag_copy(struct folio *new, struct folio *old) +{ +} #endif /* CONFIG_MEM_ALLOC_PROFILING */ #endif /* _LINUX_MM_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 0f6b78fd73aa..dfdb3a136bf8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -743,6 +743,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) folio_set_readahead(newfolio); folio_copy_owner(newfolio, folio); + pgalloc_tag_copy(newfolio, folio); mem_cgroup_migrate(folio, newfolio); } From 6857be5fecaebd9773ff27b6d29b6fff3b1abbce Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:35 -0400 Subject: [PATCH 381/416] mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Patch series "mm: Support huge pfnmaps", v2. Overview ======== This series implements huge pfnmaps support for mm in general. Huge pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow as large as 8GB or even bigger. Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last patch (from Alex Williamson) will be the first user of huge pfnmap, so as to enable vfio-pci driver to fault in huge pfn mappings. Implementation ============== In reality, it's relatively simple to add such support comparing to many other types of mappings, because of PFNMAP's specialties when there's no vmemmap backing it, so that most of the kernel routines on huge mappings should simply already fail for them, like GUPs or old-school follow_page() (which is recently rewritten to be folio_walk* APIs by David). One trick here is that we're still unmature on PUDs in generic paths here and there, as DAX is so far the only user. This patchset will add the 2nd user of it. Hugetlb can be a 3rd user if the hugetlb unification work can go on smoothly, but to be discussed later. The other trick is how to allow gup-fast working for such huge mappings even if there's no direct sign of knowing whether it's a normal page or MMIO mapping. This series chose to keep the pte_special solution, so that it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that gup-fast will be able to identify them and fail properly. Along the way, we'll also notice that the major pgtable pfn walker, aka, follow_pte(), will need to retire soon due to the fact that it only works with ptes. A new set of simple API is introduced (follow_pfnmap* API) to be able to do whatever follow_pte() can already do, plus that it can also process huge pfnmaps now. Half of this series is about that and converting all existing pfnmap walkers to use the new API properly. Hopefully the new API also looks better to avoid exposing e.g. pgtable lock details into the callers, so that it can be used in an even more straightforward way. Here, three more options will be introduced and involved in huge pfnmap: - ARCH_SUPPORTS_HUGE_PFNMAP Arch developers will need to select this option when huge pfnmap is supported in arch's Kconfig. After this patchset applied, both x86_64 and arm64 will start to enable it by default. - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP These options are for driver developers to identify whether current arch / config supports huge pfnmaps, making decision on whether it can use the huge pfnmap APIs to inject them. One can refer to the last vfio-pci patch from Alex on the use of them properly in a device driver. So after the whole set applied, and if one would enable some dynamic debug lines in vfio-pci core files, we should observe things like: vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100 In this specific case, it says that vfio-pci faults in PMDs properly for a few BAR0 offsets. Patch Layout ============ Patch 1: Introduce the new options mentioned above for huge PFNMAPs Patch 2: A tiny cleanup Patch 3-8: Preparation patches for huge pfnmap (include introduce special bit for pmd/pud) Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and then drop follow_pte() API Patch 17: Add huge pfnmap support for x86_64 Patch 18: Add huge pfnmap support for arm64 Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex) TODO ==== More architectures / More page sizes ------------------------------------ Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to have plan to support arm64 1G later on top of this series [2]. Any arch will need to first support THP / THP_1G, then provide a special bit in pmds/puds to support huge pfnmaps. remap_pfn_range() support ------------------------- Currently, remap_pfn_range() still only maps PTEs. With the new option, remap_pfn_range() can logically start to inject either PMDs or PUDs when the alignment requirements match on the VAs. When the support is there, it should be able to silently benefit all drivers that is using remap_pfn_range() in its mmap() handler on better TLB hit rate and overall faster MMIO accesses similar to processor on hugepages. More driver support ------------------- VFIO is so far the only consumer for the huge pfnmaps after this series applied. Besides above remap_pfn_range() generic optimization, device driver can also try to optimize its mmap() on a better VA alignment for either PMD/PUD sizes. This may, iiuc, normally require userspace changes, as the driver doesn't normally decide the VA to map a bar. But I don't think I know all the drivers to know the full picture. Credits all go to Alex on help testing the GPU/NIC use cases above. [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com This patch (of 19): This patch introduces the option to introduce special pte bit into pmd/puds. Archs can start to define pmd_special / pud_special when supported by selecting the new option. Per-arch support will be added later. Before that, create fallbacks for these helpers so that they are always available. Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 24 ++++++++++++++++++++++++ mm/Kconfig | 13 +++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 39f6faace590..c2307a2d275d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2643,6 +2643,30 @@ static inline pte_t pte_mkspecial(pte_t pte) } #endif +#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +static inline bool pmd_special(pmd_t pmd) +{ + return false; +} + +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return pmd; +} +#endif /* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */ + +#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +static inline bool pud_special(pud_t pud) +{ + return false; +} + +static inline pud_t pud_mkspecial(pud_t pud) +{ + return pud; +} +#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ + #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pte_devmap(pte_t pte) { diff --git a/mm/Kconfig b/mm/Kconfig index 7c23e8a889d0..1aa282e35dc7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -875,6 +875,19 @@ endif # TRANSPARENT_HUGEPAGE config PGTABLE_HAS_HUGE_LEAVES def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE +# TODO: Allow to be enabled without THP +config ARCH_SUPPORTS_HUGE_PFNMAP + def_bool n + depends on TRANSPARENT_HUGEPAGE + +config ARCH_SUPPORTS_PMD_PFNMAP + def_bool y + depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE + +config ARCH_SUPPORTS_PUD_PFNMAP + def_bool y + depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD + # # UP and nommu archs use km based percpu allocator # From ef713ec3a566d3e5e011c5d6201eb661ebf94c1f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:36 -0400 Subject: [PATCH 382/416] mm: drop is_huge_zero_pud() It constantly returns false since 2017. One assertion is added in 2019 but it should never have triggered, IOW it means what is checked should be asserted instead. If it didn't exist for 7 years maybe it's good idea to remove it and only add it when it comes. Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Acked-by: David Hildenbrand Cc: Matthew Wilcox Cc: Aneesh Kumar K.V Cc: Alexander Gordeev Cc: Alex Williamson Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 10 ---------- mm/huge_memory.c | 13 +------------ 2 files changed, 1 insertion(+), 22 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e3896ebf0f94..ffca706bac81 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -433,11 +433,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd); } -static inline bool is_huge_zero_pud(pud_t pud) -{ - return false; -} - struct folio *mm_get_huge_zero_folio(struct mm_struct *mm); void mm_put_huge_zero_folio(struct mm_struct *mm); @@ -578,11 +573,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) return false; } -static inline bool is_huge_zero_pud(pud_t pud) -{ - return false; -} - static inline void mm_put_huge_zero_folio(struct mm_struct *mm) { return; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cc2872f12030..a4a14b81e013 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1453,10 +1453,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, ptl = pud_lock(mm, pud); if (!pud_none(*pud)) { if (write) { - if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) { - WARN_ON_ONCE(!is_huge_zero_pud(*pud)); + if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn))) goto out_unlock; - } entry = pud_mkyoung(*pud); entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma); if (pudp_set_access_flags(vma, addr, pud, entry, 1)) @@ -1704,15 +1702,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud))) goto out_unlock; - /* - * When page table lock is held, the huge zero pud should not be - * under splitting since we don't split the page itself, only pud to - * a page table. - */ - if (is_huge_zero_pud(pud)) { - /* No huge zero pud yet */ - } - /* * TODO: once we support anonymous pages, use * folio_try_dup_anon_rmap_*() and split if duplicating fails. From 3c8e44c9b369b3d422516b3f2bf47a6e3c61d1ea Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:37 -0400 Subject: [PATCH 383/416] mm: mark special bits for huge pfn mappings when inject We need these special bits to be around on pfnmaps. Mark properly for !devmap case, reflecting that there's no page struct backing the entry. Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.com Reviewed-by: Jason Gunthorpe Signed-off-by: Peter Xu Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a4a14b81e013..f0983b17621c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1370,6 +1370,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pmd_mkdevmap(entry); + else + entry = pmd_mkspecial(entry); if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); @@ -1466,6 +1468,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, entry = pud_mkhuge(pfn_t_pud(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pud_mkdevmap(entry); + else + entry = pud_mkspecial(entry); if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); entry = maybe_pud_mkwrite(entry, vma); From 5dd40721f147e83733ad34848330913cb633046e Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:38 -0400 Subject: [PATCH 384/416] mm: allow THP orders for PFNMAPs This enables PFNMAPs to be mapped at either pmd/pud layers. Generalize the dax case into vma_is_special_huge() so as to cover both. Meanwhile, rename the macro to THP_ORDERS_ALL_SPECIAL. Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Acked-by: David Hildenbrand Cc: Matthew Wilcox Cc: Gavin Shan Cc: Ryan Roberts Cc: Zi Yan Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/huge_mm.h | 6 +++--- mm/huge_memory.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ffca706bac81..0b0539f4ee1a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -76,9 +76,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; /* * Mask of all large folio orders supported for file THP. Folios in a DAX * file is never split and the MAX_PAGECACHE_ORDER limit does not apply to - * it. + * it. Same to PFNMAPs where there's neither page* nor pagecache. */ -#define THP_ORDERS_ALL_FILE_DAX \ +#define THP_ORDERS_ALL_SPECIAL \ (BIT(PMD_ORDER) | BIT(PUD_ORDER)) #define THP_ORDERS_ALL_FILE_DEFAULT \ ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0)) @@ -87,7 +87,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; * Mask of all large folio orders supported for THP. */ #define THP_ORDERS_ALL \ - (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT) + (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT) #define TVA_SMAPS (1 << 0) /* Will be used for procfs */ #define TVA_IN_PF (1 << 1) /* Page fault handler */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f0983b17621c..7ab9a171e3d0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -97,8 +97,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, /* Check the intersection of requested and supported orders. */ if (vma_is_anonymous(vma)) supported_orders = THP_ORDERS_ALL_ANON; - else if (vma_is_dax(vma)) - supported_orders = THP_ORDERS_ALL_FILE_DAX; + else if (vma_is_special_huge(vma)) + supported_orders = THP_ORDERS_ALL_SPECIAL; else supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; From ae3c99e650da4a8f4deb3670c29059de375a88be Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:39 -0400 Subject: [PATCH 385/416] mm/gup: detect huge pfnmap entries in gup-fast Since gup-fast doesn't have the vma reference, teach it to detect such huge pfnmaps by checking the special bit for pmd/pud too, just like ptes. Link: https://lkml.kernel.org/r/20240826204353.2228736-6-peterx@redhat.com Signed-off-by: Peter Xu Acked-by: David Hildenbrand Reviewed-by: Jason Gunthorpe Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/gup.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 5defd5e6d8f8..69c483e2cc32 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3038,6 +3038,9 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) return 0; + if (pmd_special(orig)) + return 0; + if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; @@ -3082,6 +3085,9 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, if (!pud_access_permitted(orig, flags & FOLL_WRITE)) return 0; + if (pud_special(orig)) + return 0; + if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; From 10d83d7781a8a6ff02bafd172c1ab183b27f8d5a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:40 -0400 Subject: [PATCH 386/416] mm/pagewalk: check pfnmap for folio_walk_start() Teach folio_walk_start() to recognize special pmd/pud mappings, and fail them properly as it means there's no folio backing them. [peterx@redhat.com: remove some stale comments, per David] Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.com Signed-off-by: Peter Xu Cc: David Hildenbrand Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memory.c | 9 ++++----- mm/pagewalk.c | 2 +- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 93c0c25433d0..9a540e231625 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -672,11 +672,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, { unsigned long pfn = pmd_pfn(pmd); - /* - * There is no pmd_special() but there may be special pmds, e.g. - * in a direct-access (dax) mapping, so let's just replicate the - * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here. - */ + /* Currently it's only used for huge pfnmaps */ + if (unlikely(pmd_special(pmd))) + return NULL; + if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index cd79fb3b89e5..461ea3bbd8d9 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -753,7 +753,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, fw->pudp = pudp; fw->pud = pud; - if (!pud_present(pud) || pud_devmap(pud)) { + if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) { spin_unlock(ptl); goto not_found; } else if (!pud_leaf(pud)) { From bc02afbd4d73c4424ea12a0c35fa96e27172e8cb Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:41 -0400 Subject: [PATCH 387/416] mm/fork: accept huge pfnmap entries Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is much easier, the write bit needs to be persisted though for writable and shared pud mappings like PFNMAP ones, otherwise a follow up write in either parent or child process will trigger a write fault. Do the same for pmd level. Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/huge_memory.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7ab9a171e3d0..2a73efea02d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1583,6 +1583,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pgtable_t pgtable = NULL; int ret = -ENOMEM; + pmd = pmdp_get_lockless(src_pmd); + if (unlikely(pmd_special(pmd))) { + dst_ptl = pmd_lock(dst_mm, dst_pmd); + src_ptl = pmd_lockptr(src_mm, src_pmd); + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + /* + * No need to recheck the pmd, it can't change with write + * mmap lock held here. + * + * Meanwhile, making sure it's not a CoW VMA with writable + * mapping, otherwise it means either the anon page wrongly + * applied special bit, or we made the PRIVATE mapping be + * able to wrongly write to the backend MMIO. + */ + VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); + goto set_pmd; + } + /* Skip if can be re-fill on fault */ if (!vma_is_anonymous(dst_vma)) return 0; @@ -1664,7 +1682,9 @@ out_zero_page: pmdp_set_wrprotect(src_mm, addr, src_pmd); if (!userfaultfd_wp(dst_vma)) pmd = pmd_clear_uffd_wp(pmd); - pmd = pmd_mkold(pmd_wrprotect(pmd)); + pmd = pmd_wrprotect(pmd); +set_pmd: + pmd = pmd_mkold(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ret = 0; @@ -1710,8 +1730,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, * TODO: once we support anonymous pages, use * folio_try_dup_anon_rmap_*() and split if duplicating fails. */ - pudp_set_wrprotect(src_mm, addr, src_pud); - pud = pud_mkold(pud_wrprotect(pud)); + if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) { + pudp_set_wrprotect(src_mm, addr, src_pud); + pud = pud_wrprotect(pud); + } + pud = pud_mkold(pud); set_pud_at(dst_mm, addr, dst_pud, pud); ret = 0; From 0515e022e167cfacf1fee092eb93aa9514e23c0a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:42 -0400 Subject: [PATCH 388/416] mm: always define pxx_pgprot() There're: - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that support pte_pgprot(). - 2 archs (x86, sparc) that support pmd_pgprot(). - 1 arch (x86) that support pud_pgprot(). Always define them to be used in generic code, and then we don't need to fiddle with "#ifdef"s when doing so. Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/arm64/include/asm/pgtable.h | 1 + arch/powerpc/include/asm/pgtable.h | 1 + arch/s390/include/asm/pgtable.h | 1 + arch/sparc/include/asm/pgtable_64.h | 1 + include/linux/pgtable.h | 12 ++++++++++++ 5 files changed, 16 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7a4f5604be3f..b78cc4a6758b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -384,6 +384,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) /* * Select all bits except the pfn */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pfn = pte_pfn(pte); diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 264a6c09517a..2f72ad885332 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -65,6 +65,7 @@ static inline unsigned long pte_pfn(pte_t pte) /* * Select all bits except the pfn */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pte_flags; diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 3fa280d0672a..0ffbaf741955 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -955,6 +955,7 @@ static inline int pte_unused(pte_t pte) * young/old accounting is not supported, i.e _PAGE_PROTECT and _PAGE_INVALID * must not be set. */ +#define pte_pgprot pte_pgprot static inline pgprot_t pte_pgprot(pte_t pte) { unsigned long pte_flags = pte_val(pte) & _PAGE_CHG_MASK; diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 3fe429d73a65..2b7f358762c1 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -783,6 +783,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) return __pmd(pte_val(pte)); } +#define pmd_pgprot pmd_pgprot static inline pgprot_t pmd_pgprot(pmd_t entry) { unsigned long val = pmd_val(entry); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 780f3b439d98..e8b2ac6bd2ae 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1956,6 +1956,18 @@ typedef unsigned int pgtbl_mod_mask; #define MAX_PTRS_PER_P4D PTRS_PER_P4D #endif +#ifndef pte_pgprot +#define pte_pgprot(x) ((pgprot_t) {0}) +#endif + +#ifndef pmd_pgprot +#define pmd_pgprot(x) ((pgprot_t) {0}) +#endif + +#ifndef pud_pgprot +#define pud_pgprot(x) ((pgprot_t) {0}) +#endif + /* description of effects of mapping type and prot in current implementation. * this is due to the limited x86 page protection hardware. The expected * behavior is in parens: From 6da8e9634bb7e3fdad9ae0e4db873a05036c4343 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:43 -0400 Subject: [PATCH 389/416] mm: new follow_pfnmap API Introduce a pair of APIs to follow pfn mappings to get entry information. It's very similar to what follow_pte() does before, but different in that it recognizes huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 31 ++++++++++ mm/memory.c | 150 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 181 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index c2307a2d275d..62bd89414897 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2373,6 +2373,37 @@ int follow_pte(struct vm_area_struct *vma, unsigned long address, int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); +struct follow_pfnmap_args { + /** + * Inputs: + * @vma: Pointer to @vm_area_struct struct + * @address: the virtual address to walk + */ + struct vm_area_struct *vma; + unsigned long address; + /** + * Internals: + * + * The caller shouldn't touch any of these. + */ + spinlock_t *lock; + pte_t *ptep; + /** + * Outputs: + * + * @pfn: the PFN of the address + * @pgprot: the pgprot_t of the mapping + * @writable: whether the mapping is writable + * @special: whether the mapping is a special mapping (real PFN maps) + */ + unsigned long pfn; + pgprot_t pgprot; + bool writable; + bool special; +}; +int follow_pfnmap_start(struct follow_pfnmap_args *args); +void follow_pfnmap_end(struct follow_pfnmap_args *args); + extern void truncate_pagecache(struct inode *inode, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); diff --git a/mm/memory.c b/mm/memory.c index 9a540e231625..3878bf69bc14 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6172,6 +6172,156 @@ out: } EXPORT_SYMBOL_GPL(follow_pte); +static inline void pfnmap_args_setup(struct follow_pfnmap_args *args, + spinlock_t *lock, pte_t *ptep, + pgprot_t pgprot, unsigned long pfn_base, + unsigned long addr_mask, bool writable, + bool special) +{ + args->lock = lock; + args->ptep = ptep; + args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT); + args->pgprot = pgprot; + args->writable = writable; + args->special = special; +} + +static inline void pfnmap_lockdep_assert(struct vm_area_struct *vma) +{ +#ifdef CONFIG_LOCKDEP + struct address_space *mapping = vma->vm_file->f_mapping; + + if (mapping) + lockdep_assert(lockdep_is_held(&vma->vm_file->f_mapping->i_mmap_rwsem) || + lockdep_is_held(&vma->vm_mm->mmap_lock)); + else + lockdep_assert(lockdep_is_held(&vma->vm_mm->mmap_lock)); +#endif +} + +/** + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address + * @args: Pointer to struct @follow_pfnmap_args + * + * The caller needs to setup args->vma and args->address to point to the + * virtual address as the target of such lookup. On a successful return, + * the results will be put into other output fields. + * + * After the caller finished using the fields, the caller must invoke + * another follow_pfnmap_end() to proper releases the locks and resources + * of such look up request. + * + * During the start() and end() calls, the results in @args will be valid + * as proper locks will be held. After the end() is called, all the fields + * in @follow_pfnmap_args will be invalid to be further accessed. Further + * use of such information after end() may require proper synchronizations + * by the caller with page table updates, otherwise it can create a + * security bug. + * + * If the PTE maps a refcounted page, callers are responsible to protect + * against invalidation with MMU notifiers; otherwise access to the PFN at + * a later point in time can trigger use-after-free. + * + * Only IO mappings and raw PFN mappings are allowed. The mmap semaphore + * should be taken for read, and the mmap semaphore cannot be released + * before the end() is invoked. + * + * This function must not be used to modify PTE content. + * + * Return: zero on success, negative otherwise. + */ +int follow_pfnmap_start(struct follow_pfnmap_args *args) +{ + struct vm_area_struct *vma = args->vma; + unsigned long address = args->address; + struct mm_struct *mm = vma->vm_mm; + spinlock_t *lock; + pgd_t *pgdp; + p4d_t *p4dp, p4d; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; + pte_t *ptep, pte; + + pfnmap_lockdep_assert(vma); + + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) + goto out; + + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + goto out; +retry: + pgdp = pgd_offset(mm, address); + if (pgd_none(*pgdp) || unlikely(pgd_bad(*pgdp))) + goto out; + + p4dp = p4d_offset(pgdp, address); + p4d = READ_ONCE(*p4dp); + if (p4d_none(p4d) || unlikely(p4d_bad(p4d))) + goto out; + + pudp = pud_offset(p4dp, address); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + goto out; + if (pud_leaf(pud)) { + lock = pud_lock(mm, pudp); + if (!unlikely(pud_leaf(pud))) { + spin_unlock(lock); + goto retry; + } + pfnmap_args_setup(args, lock, NULL, pud_pgprot(pud), + pud_pfn(pud), PUD_MASK, pud_write(pud), + pud_special(pud)); + return 0; + } + + pmdp = pmd_offset(pudp, address); + pmd = pmdp_get_lockless(pmdp); + if (pmd_leaf(pmd)) { + lock = pmd_lock(mm, pmdp); + if (!unlikely(pmd_leaf(pmd))) { + spin_unlock(lock); + goto retry; + } + pfnmap_args_setup(args, lock, NULL, pmd_pgprot(pmd), + pmd_pfn(pmd), PMD_MASK, pmd_write(pmd), + pmd_special(pmd)); + return 0; + } + + ptep = pte_offset_map_lock(mm, pmdp, address, &lock); + if (!ptep) + goto out; + pte = ptep_get(ptep); + if (!pte_present(pte)) + goto unlock; + pfnmap_args_setup(args, lock, ptep, pte_pgprot(pte), + pte_pfn(pte), PAGE_MASK, pte_write(pte), + pte_special(pte)); + return 0; +unlock: + pte_unmap_unlock(ptep, lock); +out: + return -EINVAL; +} +EXPORT_SYMBOL_GPL(follow_pfnmap_start); + +/** + * follow_pfnmap_end(): End a follow_pfnmap_start() process + * @args: Pointer to struct @follow_pfnmap_args + * + * Must be used in pair of follow_pfnmap_start(). See the start() function + * above for more information. + */ +void follow_pfnmap_end(struct follow_pfnmap_args *args) +{ + if (args->lock) + spin_unlock(args->lock); + if (args->ptep) + pte_unmap(args->ptep); +} +EXPORT_SYMBOL_GPL(follow_pfnmap_end); + #ifdef CONFIG_HAVE_IOREMAP_PROT /** * generic_access_phys - generic implementation for iomem mmap access From 5731aacd54a883dd2c1a5e8c85e1fe78fc728dc7 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:44 -0400 Subject: [PATCH 390/416] KVM: use follow_pfnmap API Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work is done perfectly on the other side (host_pfn_mapping_level()). Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.com Signed-off-by: Peter Xu Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Ryan Roberts Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- virt/kvm/kvm_main.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb2b78e92910..f416d5e3f9c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2860,13 +2860,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool write_fault, bool *writable, kvm_pfn_t *p_pfn) { + struct follow_pfnmap_args args = { .vma = vma, .address = addr }; kvm_pfn_t pfn; - pte_t *ptep; - pte_t pte; - spinlock_t *ptl; int r; - r = follow_pte(vma, addr, &ptep, &ptl); + r = follow_pfnmap_start(&args); if (r) { /* * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does @@ -2881,21 +2879,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, if (r) return r; - r = follow_pte(vma, addr, &ptep, &ptl); + r = follow_pfnmap_start(&args); if (r) return r; } - pte = ptep_get(ptep); - - if (write_fault && !pte_write(pte)) { + if (write_fault && !args.writable) { pfn = KVM_PFN_ERR_RO_FAULT; goto out; } if (writable) - *writable = pte_write(pte); - pfn = pte_pfn(pte); + *writable = args.writable; + pfn = args.pfn; /* * Get a reference here because callers of *hva_to_pfn* and @@ -2916,9 +2912,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, */ if (!kvm_try_get_pfn(pfn)) r = -EFAULT; - out: - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); *p_pfn = pfn; return r; From bd8c2d18bf5cccd8842d00b17d6f222beb98b1b3 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:45 -0400 Subject: [PATCH 391/416] s390/pci_mmio: use follow_pfnmap API Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-12-peterx@redhat.com Signed-off-by: Peter Xu Cc: Niklas Schnelle Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Sven Schnelle Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/s390/pci/pci_mmio.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 5398729bfe1b..de5c0b389a3e 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -118,12 +118,11 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { + struct follow_pfnmap_args args = { }; u8 local_buf[64]; void __iomem *io_addr; void *buf; struct vm_area_struct *vma; - pte_t *ptep; - spinlock_t *ptl; long ret; if (!zpci_is_enabled()) @@ -169,11 +168,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock_mmap; - ret = follow_pte(vma, mmio_addr, &ptep, &ptl); + args.address = mmio_addr; + args.vma = vma; + ret = follow_pfnmap_start(&args); if (ret) goto out_unlock_mmap; - io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | + io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK)); if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) @@ -181,7 +182,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, ret = zpci_memcpy_toio(io_addr, buf, length); out_unlock_pt: - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); out_unlock_mmap: mmap_read_unlock(current->mm); out_free: @@ -260,12 +261,11 @@ static inline int __memcpy_fromio_inuser(void __user *dst, SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, void __user *, user_buffer, size_t, length) { + struct follow_pfnmap_args args = { }; u8 local_buf[64]; void __iomem *io_addr; void *buf; struct vm_area_struct *vma; - pte_t *ptep; - spinlock_t *ptl; long ret; if (!zpci_is_enabled()) @@ -308,11 +308,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, if (!(vma->vm_flags & VM_WRITE)) goto out_unlock_mmap; - ret = follow_pte(vma, mmio_addr, &ptep, &ptl); + args.vma = vma; + args.address = mmio_addr; + ret = follow_pfnmap_start(&args); if (ret) goto out_unlock_mmap; - io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | + io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK)); if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) { @@ -322,7 +324,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, ret = zpci_memcpy_fromio(buf, io_addr, length); out_unlock_pt: - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); out_unlock_mmap: mmap_read_unlock(current->mm); From cbea8536d933d546ceb1005bf9c04f9d01da8092 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:46 -0400 Subject: [PATCH 392/416] mm/x86/pat: use the new follow_pfnmap API Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-13-peterx@redhat.com Signed-off-by: Peter Xu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Catalin Marinas Cc: Christian Borntraeger Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/x86/mm/pat/memtype.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 1fa0bf6ed295..f73b5ce270b3 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -951,23 +951,20 @@ static void free_pfn_range(u64 paddr, unsigned long size) static int follow_phys(struct vm_area_struct *vma, unsigned long *prot, resource_size_t *phys) { - pte_t *ptep, pte; - spinlock_t *ptl; + struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start }; - if (follow_pte(vma, vma->vm_start, &ptep, &ptl)) + if (follow_pfnmap_start(&args)) return -EINVAL; - pte = ptep_get(ptep); - /* Never return PFNs of anon folios in COW mappings. */ - if (vm_normal_folio(vma, vma->vm_start, pte)) { - pte_unmap_unlock(ptep, ptl); + if (!args.special) { + follow_pfnmap_end(&args); return -EINVAL; } - *prot = pgprot_val(pte_pgprot(pte)); - *phys = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT; - pte_unmap_unlock(ptep, ptl); + *prot = pgprot_val(args.pgprot); + *phys = (resource_size_t)args.pfn << PAGE_SHIFT; + follow_pfnmap_end(&args); return 0; } From a77f9489f1d7873a56e1d6640cc0c4865f64176b Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:47 -0400 Subject: [PATCH 393/416] vfio: use the new follow_pfnmap API Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alex Williamson Cc: Jason Gunthorpe Cc: Alexander Gordeev Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- drivers/vfio/vfio_iommu_type1.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 0960699e7554..bf391b40e576 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -513,12 +513,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, unsigned long vaddr, unsigned long *pfn, bool write_fault) { - pte_t *ptep; - pte_t pte; - spinlock_t *ptl; + struct follow_pfnmap_args args = { .vma = vma, .address = vaddr }; int ret; - ret = follow_pte(vma, vaddr, &ptep, &ptl); + ret = follow_pfnmap_start(&args); if (ret) { bool unlocked = false; @@ -532,19 +530,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) return ret; - ret = follow_pte(vma, vaddr, &ptep, &ptl); + ret = follow_pfnmap_start(&args); if (ret) return ret; } - pte = ptep_get(ptep); - - if (write_fault && !pte_write(pte)) + if (write_fault && !args.writable) ret = -EFAULT; else - *pfn = pte_pfn(pte); + *pfn = args.pfn; - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); return ret; } From e6bc784c24fdabdad923aaa45c1928b2cde8a0c9 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:48 -0400 Subject: [PATCH 394/416] acrn: use the new follow_pfnmap API Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-15-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- drivers/virt/acrn/mm.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/virt/acrn/mm.c b/drivers/virt/acrn/mm.c index db8ff1d0ac23..4c2f28715b70 100644 --- a/drivers/virt/acrn/mm.c +++ b/drivers/virt/acrn/mm.c @@ -177,9 +177,7 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap) vma = vma_lookup(current->mm, memmap->vma_base); if (vma && ((vma->vm_flags & VM_PFNMAP) != 0)) { unsigned long start_pfn, cur_pfn; - spinlock_t *ptl; bool writable; - pte_t *ptep; if ((memmap->vma_base + memmap->len) > vma->vm_end) { mmap_read_unlock(current->mm); @@ -187,16 +185,20 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap) } for (i = 0; i < nr_pages; i++) { - ret = follow_pte(vma, memmap->vma_base + i * PAGE_SIZE, - &ptep, &ptl); + struct follow_pfnmap_args args = { + .vma = vma, + .address = memmap->vma_base + i * PAGE_SIZE, + }; + + ret = follow_pfnmap_start(&args); if (ret) break; - cur_pfn = pte_pfn(ptep_get(ptep)); + cur_pfn = args.pfn; if (i == 0) start_pfn = cur_pfn; - writable = !!pte_write(ptep_get(ptep)); - pte_unmap_unlock(ptep, ptl); + writable = args.writable; + follow_pfnmap_end(&args); /* Disallow write access if the PTE is not writable. */ if (!writable && From b17269a51cc7f046a6f2cf9a6c314a0de885e5a5 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:49 -0400 Subject: [PATCH 395/416] mm/access_process_vm: use the new follow_pfnmap API Use the new API that can understand huge pfn mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- mm/memory.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3878bf69bc14..cfc278691466 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6341,34 +6341,34 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, resource_size_t phys_addr; unsigned long prot = 0; void __iomem *maddr; - pte_t *ptep, pte; - spinlock_t *ptl; int offset = offset_in_page(addr); int ret = -EINVAL; + bool writable; + struct follow_pfnmap_args args = { .vma = vma, .address = addr }; retry: - if (follow_pte(vma, addr, &ptep, &ptl)) + if (follow_pfnmap_start(&args)) return -EINVAL; - pte = ptep_get(ptep); - pte_unmap_unlock(ptep, ptl); + prot = pgprot_val(args.pgprot); + phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT; + writable = args.writable; + follow_pfnmap_end(&args); - prot = pgprot_val(pte_pgprot(pte)); - phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT; - - if ((write & FOLL_WRITE) && !pte_write(pte)) + if ((write & FOLL_WRITE) && !writable) return -EINVAL; maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (!maddr) return -ENOMEM; - if (follow_pte(vma, addr, &ptep, &ptl)) + if (follow_pfnmap_start(&args)) goto out_unmap; - if (!pte_same(pte, ptep_get(ptep))) { - pte_unmap_unlock(ptep, ptl); + if ((prot != pgprot_val(args.pgprot)) || + (phys_addr != (args.pfn << PAGE_SHIFT)) || + (writable != args.writable)) { + follow_pfnmap_end(&args); iounmap(maddr); - goto retry; } @@ -6377,7 +6377,7 @@ retry: else memcpy_fromio(buf, maddr + offset, len); ret = len; - pte_unmap_unlock(ptep, ptl); + follow_pfnmap_end(&args); out_unmap: iounmap(maddr); From b0a1c0d0edcd75a0f8ec5fd19dbd64b8d097f534 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:50 -0400 Subject: [PATCH 396/416] mm: remove follow_pte() follow_pte() users have been converted to follow_pfnmap*(). Remove the API. Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/mm.h | 2 -- mm/memory.c | 73 ---------------------------------------------- 2 files changed, 75 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 62bd89414897..d750be768121 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2368,8 +2368,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); -int follow_pte(struct vm_area_struct *vma, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp); int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); diff --git a/mm/memory.c b/mm/memory.c index cfc278691466..42674c0748cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6099,79 +6099,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -/** - * follow_pte - look up PTE at a user virtual address - * @vma: the memory mapping - * @address: user virtual address - * @ptepp: location to store found PTE - * @ptlp: location to store the lock for the PTE - * - * On a successful return, the pointer to the PTE is stored in @ptepp; - * the corresponding lock is taken and its location is stored in @ptlp. - * - * The contents of the PTE are only stable until @ptlp is released using - * pte_unmap_unlock(). This function will fail if the PTE is non-present. - * Present PTEs may include PTEs that map refcounted pages, such as - * anonymous folios in COW mappings. - * - * Callers must be careful when relying on PTE content after - * pte_unmap_unlock(). Especially if the PTE maps a refcounted page, - * callers must protect against invalidation with MMU notifiers; otherwise - * access to the PFN at a later point in time can trigger use-after-free. - * - * Only IO mappings and raw PFN mappings are allowed. The mmap semaphore - * should be taken for read. - * - * This function must not be used to modify PTE content. - * - * Return: zero on success, -ve otherwise. - */ -int follow_pte(struct vm_area_struct *vma, unsigned long address, - pte_t **ptepp, spinlock_t **ptlp) -{ - struct mm_struct *mm = vma->vm_mm; - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - - mmap_assert_locked(mm); - if (unlikely(address < vma->vm_start || address >= vma->vm_end)) - goto out; - - if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) - goto out; - - pgd = pgd_offset(mm, address); - if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) - goto out; - - p4d = p4d_offset(pgd, address); - if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d))) - goto out; - - pud = pud_offset(p4d, address); - if (pud_none(*pud) || unlikely(pud_bad(*pud))) - goto out; - - pmd = pmd_offset(pud, address); - VM_BUG_ON(pmd_trans_huge(*pmd)); - - ptep = pte_offset_map_lock(mm, pmd, address, ptlp); - if (!ptep) - goto out; - if (!pte_present(ptep_get(ptep))) - goto unlock; - *ptepp = ptep; - return 0; -unlock: - pte_unmap_unlock(ptep, *ptlp); -out: - return -EINVAL; -} -EXPORT_SYMBOL_GPL(follow_pte); - static inline void pfnmap_args_setup(struct follow_pfnmap_args *args, spinlock_t *lock, pte_t *ptep, pgprot_t pgprot, unsigned long pfn_base, From 75182022a0439788415b2dd1db3086e07aa506f7 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:51 -0400 Subject: [PATCH 397/416] mm/x86: support large pfn mappings Helpers to install and detect special pmd/pud entries. In short, bit 9 on x86 is not used for pmd/pud, so we can directly define them the same as the pte level. One note is that it's also used in _PAGE_BIT_CPA_TEST but that is only used in the debug test, and shouldn't conflict in this case. One note is that pxx_set|clear_flags() for pmd/pud will need to be moved upper so that they can be referenced by the new special bit helpers. There's no change in the code that was moved. Link: https://lkml.kernel.org/r/20240826204353.2228736-18-peterx@redhat.com Signed-off-by: Peter Xu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Catalin Marinas Cc: Christian Borntraeger Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 80 ++++++++++++++++++++++------------ 2 files changed, 53 insertions(+), 28 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b74b9ee484da..d4dbe9717e96 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -28,6 +28,7 @@ config X86_64 select ARCH_HAS_GIGANTIC_PAGE select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select HAVE_ARCH_SOFT_DIRTY select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 8d12bfad6a1d..4c2d080d26b4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -120,6 +120,34 @@ extern pmdval_t early_pmd_flags; #define arch_end_context_switch(prev) do {} while(0) #endif /* CONFIG_PARAVIRT_XXL */ +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v | set); +} + +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) +{ + pmdval_t v = native_pmd_val(pmd); + + return native_make_pmd(v & ~clear); +} + +static inline pud_t pud_set_flags(pud_t pud, pudval_t set) +{ + pudval_t v = native_pud_val(pud); + + return native_make_pud(v | set); +} + +static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) +{ + pudval_t v = native_pud_val(pud); + + return native_make_pud(v & ~clear); +} + /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -317,6 +345,30 @@ static inline int pud_devmap(pud_t pud) } #endif +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +static inline bool pmd_special(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SPECIAL; +} + +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SPECIAL); +} +#endif /* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */ + +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +static inline bool pud_special(pud_t pud) +{ + return pud_flags(pud) & _PAGE_SPECIAL; +} + +static inline pud_t pud_mkspecial(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_SPECIAL); +} +#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */ + static inline int pgd_devmap(pgd_t pgd) { return 0; @@ -487,20 +539,6 @@ static inline pte_t pte_mkdevmap(pte_t pte) return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); } -static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v | set); -} - -static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) -{ - pmdval_t v = native_pmd_val(pmd); - - return native_make_pmd(v & ~clear); -} - /* See comments above mksaveddirty_shift() */ static inline pmd_t pmd_mksaveddirty(pmd_t pmd) { @@ -595,20 +633,6 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); #define pmd_mkwrite pmd_mkwrite -static inline pud_t pud_set_flags(pud_t pud, pudval_t set) -{ - pudval_t v = native_pud_val(pud); - - return native_make_pud(v | set); -} - -static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) -{ - pudval_t v = native_pud_val(pud); - - return native_make_pud(v & ~clear); -} - /* See comments above mksaveddirty_shift() */ static inline pud_t pud_mksaveddirty(pud_t pud) { From 3e509c9b03f9abc7804c80bed266a6cc4286a5a8 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 26 Aug 2024 16:43:52 -0400 Subject: [PATCH 398/416] mm/arm64: support large pfn mappings Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on pmds/puds. Provide the pmd/pud helpers to set/get special bit. There's one more thing missing for arm64 which is the pxx_pgprot() for pmd/pud. Add them too, which is mostly the same as the pte version by dropping the pfn field. These helpers are essential to be used in the new follow_pfnmap*() API to report valid pgprot_t results. Note that arm64 doesn't yet support huge PUD yet, but it's still straightforward to provide the pud helpers that we need altogether. Only PMD helpers will make an immediate benefit until arm64 will support huge PUDs first in general (e.g. in THPs). Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.com Signed-off-by: Peter Xu Cc: Catalin Marinas Cc: Will Deacon Cc: Alexander Gordeev Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6494848019a0..6607ed8fdbb4 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -99,6 +99,7 @@ config ARM64 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b78cc4a6758b..2faecc033a19 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -578,6 +578,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP))); } +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +#define pmd_special(pte) (!!((pmd_val(pte) & PTE_SPECIAL))) +static inline pmd_t pmd_mkspecial(pmd_t pmd) +{ + return set_pmd_bit(pmd, __pgprot(PTE_SPECIAL)); +} +#endif + #define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd)) #define __phys_to_pmd_val(phys) __phys_to_pte_val(phys) #define pmd_pfn(pmd) ((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT) @@ -595,6 +603,27 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) #define pud_pfn(pud) ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP +#define pud_special(pte) pte_special(pud_pte(pud)) +#define pud_mkspecial(pte) pte_pud(pte_mkspecial(pud_pte(pud))) +#endif + +#define pmd_pgprot pmd_pgprot +static inline pgprot_t pmd_pgprot(pmd_t pmd) +{ + unsigned long pfn = pmd_pfn(pmd); + + return __pgprot(pmd_val(pfn_pmd(pfn, __pgprot(0))) ^ pmd_val(pmd)); +} + +#define pud_pgprot pud_pgprot +static inline pgprot_t pud_pgprot(pud_t pud) +{ + unsigned long pfn = pud_pfn(pud); + + return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud)); +} + static inline void __set_pte_at(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) From f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 Mon Sep 17 00:00:00 2001 From: Alex Williamson Date: Mon, 26 Aug 2024 16:43:53 -0400 Subject: [PATCH 399/416] vfio/pci: implement huge_fault support With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can take advantage of PMD and PUD faults to PCI BAR mmaps and create more efficient mappings. PCI BARs are always a power of two and will typically get at least PMD alignment without userspace even trying. Userspace alignment for PUD mappings is also not too difficult. Consolidate faults through a single handler with a new wrapper for standard single page faults. The pre-faulting behavior of commit d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed in this refactoring since huge_fault will cover the bulk of the faults and results in more efficient page table usage. We also want to avoid that pre-faulted single page mappings preempt huge page mappings. Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com Signed-off-by: Alex Williamson Signed-off-by: Peter Xu Cc: Alexander Gordeev Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Dave Hansen Cc: David Hildenbrand Cc: Gavin Shan Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Matthew Wilcox Cc: Niklas Schnelle Cc: Paolo Bonzini Cc: Ryan Roberts Cc: Sean Christopherson Cc: Sven Schnelle Cc: Thomas Gleixner Cc: Vasily Gorbik Cc: Will Deacon Cc: Zi Yan Signed-off-by: Andrew Morton --- drivers/vfio/pci/vfio_pci_core.c | 60 +++++++++++++++++++++++--------- 1 file changed, 43 insertions(+), 17 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index ba0ce0075b2f..2d7478e9a62d 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -1657,14 +1658,20 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma) return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; } -static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) +static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, + unsigned int order) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_core_device *vdev = vma->vm_private_data; unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff; - unsigned long addr = vma->vm_start; vm_fault_t ret = VM_FAULT_SIGBUS; + if (order && (vmf->address & ((PAGE_SIZE << order) - 1) || + vmf->address + (PAGE_SIZE << order) > vma->vm_end)) { + ret = VM_FAULT_FALLBACK; + goto out; + } + pfn = vma_to_pfn(vma); down_read(&vdev->memory_lock); @@ -1672,30 +1679,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) goto out_unlock; - ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); - if (ret & VM_FAULT_ERROR) - goto out_unlock; - - /* - * Pre-fault the remainder of the vma, abort further insertions and - * supress error if fault is encountered during pre-fault. - */ - for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) { - if (addr == vmf->address) - continue; - - if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR) - break; + switch (order) { + case 0: + ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); + break; +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP + case PMD_ORDER: + ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn + pgoff, + PFN_DEV), false); + break; +#endif +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP + case PUD_ORDER: + ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn + pgoff, + PFN_DEV), false); + break; +#endif + default: + ret = VM_FAULT_FALLBACK; } out_unlock: up_read(&vdev->memory_lock); +out: + dev_dbg_ratelimited(&vdev->pdev->dev, + "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n", + __func__, order, + vma->vm_pgoff >> + (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT), + pgoff, (unsigned int)ret); return ret; } +static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf) +{ + return vfio_pci_mmap_huge_fault(vmf, 0); +} + static const struct vm_operations_struct vfio_pci_mmap_ops = { - .fault = vfio_pci_mmap_fault, + .fault = vfio_pci_mmap_page_fault, +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP + .huge_fault = vfio_pci_mmap_huge_fault, +#endif }; int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma) From 7a2369b74abf76cd3e54c45b30f6addb497f831b Mon Sep 17 00:00:00 2001 From: Yosry Ahmed Date: Wed, 4 Sep 2024 23:33:43 +0000 Subject: [PATCH 400/416] mm: z3fold: deprecate CONFIG_Z3FOLD The z3fold compressed pages allocator is rarely used, most users use zsmalloc. The only disadvantage of zsmalloc in comparison is the dependency on MMU, and zbud is a more common option for !MMU as it was the default zswap allocator for a long time. Historically, zsmalloc had worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. That analysis showed that z3fold does not have any advantage over zsmalloc or zbud considering both performance and memory usage. In a kernel build test on tmpfs in a limited cgroup, z3fold took 3% more time and used 1.8% more memory. The latency of zswap_load() was 7% higher, and that of zswap_store() was 10% higher. Zsmalloc is better in all metrics. Moreover, z3fold apparently has latent bugs, which was made noticeable by a recent soft lockup bug report with z3fold [2]. Switching to zsmalloc not only fixed the problem, but also reduced the swap usage from 6~8G to 1~2G. Other users have also reported being bitten by mistakenly enabling z3fold. Other than hurting users, z3fold is repeatedly causing wasted engineering effort. Apart from investigating the above bug, it came up in multiple development discussions (e.g. [3]) as something we need to handle, when there aren't any legit users (at least not intentionally). The natural course of action is to deprecate z3fold, and remove in a few cycles if no objections are raised from active users. Next on the list should be zbud, as it offers marginal latency gains at the cost of huge memory waste when compared to zsmalloc. That one will need to wait until zsmalloc does not depend on MMU. Rename the user-visible config option from CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new prompt with explanation during make oldconfig. Also, remove CONFIG_Z3FOLD=y from defconfigs. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/EF0ABD3E-A239-4111-A8AB-5C442E759CF3@gmail.com/ [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/ [arnd@arndb.de: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well] Link: https://lkml.kernel.org/r/20240909202625.1054880-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240904233343.933462-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed Signed-off-by: Arnd Bergmann Acked-by: Chris Down Acked-by: Nhat Pham Acked-by: Johannes Weiner Acked-by: Vitaly Wool Acked-by: Christoph Hellwig Cc: Aneesh Kumar K.V Cc: Christophe Leroy Cc: Huacai Chen Cc: Miaohe Lin Cc: Michael Ellerman Cc: Naveen N. Rao Cc: Nicholas Piggin Cc: Sergey Senozhatsky Cc: WANG Xuerui Cc: Signed-off-by: Andrew Morton --- arch/loongarch/configs/loongson3_defconfig | 1 - arch/powerpc/configs/ppc64_defconfig | 1 - mm/Kconfig | 25 ++++++++++++++++------ 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/loongarch/configs/loongson3_defconfig b/arch/loongarch/configs/loongson3_defconfig index b4252c357c8e..75b366407a60 100644 --- a/arch/loongarch/configs/loongson3_defconfig +++ b/arch/loongarch/configs/loongson3_defconfig @@ -96,7 +96,6 @@ CONFIG_ZPOOL=y CONFIG_ZSWAP=y CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD=y CONFIG_ZBUD=y -CONFIG_Z3FOLD=y CONFIG_ZSMALLOC=m # CONFIG_COMPAT_BRK is not set CONFIG_MEMORY_HOTPLUG=y diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index 544a65fda77b..d39284489aa2 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -81,7 +81,6 @@ CONFIG_MODULE_SIG_SHA512=y CONFIG_PARTITION_ADVANCED=y CONFIG_BINFMT_MISC=m CONFIG_ZSWAP=y -CONFIG_Z3FOLD=y CONFIG_ZSMALLOC=y # CONFIG_SLAB_MERGE_DEFAULT is not set CONFIG_SLAB_FREELIST_RANDOM=y diff --git a/mm/Kconfig b/mm/Kconfig index 1aa282e35dc7..09aebca1cae3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -146,12 +146,15 @@ config ZSWAP_ZPOOL_DEFAULT_ZBUD help Use the zbud allocator as the default allocator. -config ZSWAP_ZPOOL_DEFAULT_Z3FOLD - bool "z3fold" - select Z3FOLD +config ZSWAP_ZPOOL_DEFAULT_Z3FOLD_DEPRECATED + bool "z3foldi (DEPRECATED)" + select Z3FOLD_DEPRECATED help Use the z3fold allocator as the default allocator. + Deprecated and scheduled for removal in a few cycles, + see CONFIG_Z3FOLD_DEPRECATED. + config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC bool "zsmalloc" select ZSMALLOC @@ -163,7 +166,7 @@ config ZSWAP_ZPOOL_DEFAULT string depends on ZSWAP default "zbud" if ZSWAP_ZPOOL_DEFAULT_ZBUD - default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD + default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD_DEPRECATED default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC default "" @@ -177,15 +180,25 @@ config ZBUD deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. -config Z3FOLD - tristate "3:1 compression allocator (z3fold)" +config Z3FOLD_DEPRECATED + tristate "3:1 compression allocator (z3fold) (DEPRECATED)" depends on ZSWAP help + Deprecated and scheduled for removal in a few cycles. If you have + a good reason for using Z3FOLD over ZSMALLOC, please contact + linux-mm@kvack.org and the zswap maintainers. + A special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. It is a ZBUD derivative so the simplicity and determinism are still there. +config Z3FOLD + tristate + default y if Z3FOLD_DEPRECATED=y + default m if Z3FOLD_DEPRECATED=m + depends on Z3FOLD_DEPRECATED + config ZSMALLOC tristate prompt "N:1 compression allocator (zsmalloc)" if (ZSWAP || ZRAM) From bacf9c3cbb18f7e3f67521516d881892f46bbcef Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 6 Sep 2024 11:07:12 +0800 Subject: [PATCH 401/416] resource: make alloc_free_mem_region() works for iomem_resource During developing a kunit test case for region_intersects(), some fake resources need to be inserted into iomem_resource. To do that, a resource hole needs to be found first in iomem_resource. However, alloc_free_mem_region() cannot work for iomem_resource now. Because the start address to check cannot be 0 to detect address wrapping 0 in gfr_continue(), while iomem_resource.start == 0. To make alloc_free_mem_region() works for iomem_resource, gfr_start() is changed to avoid to return 0 even if base->start == 0. We don't need to check 0 as start address. Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" Cc: Dan Williams Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: Jonathan Cameron Cc: Dave Jiang Cc: Alison Schofield Cc: Vishal Verma Cc: Ira Weiny Cc: Alistair Popple Cc: Andy Shevchenko Cc: Bjorn Helgaas Cc: Baoquan He Signed-off-by: Andrew Morton --- kernel/resource.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/resource.c b/kernel/resource.c index 9f747bb7cd03..5e76b98c9771 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1830,7 +1830,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size, return end - size + 1; } - return ALIGN(base->start, align); + return ALIGN(max(base->start, align), align); } static bool gfr_continue(struct resource *base, resource_size_t addr, From 99185c10d5d9214d0d0c8b7866660203e344ee3b Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 6 Sep 2024 11:07:13 +0800 Subject: [PATCH 402/416] resource, kunit: add test case for region_intersects() Patch series "resource: Fix region_intersects() vs add_memory_driver_managed()", v3. The patchset fixes a bug of region_intersects() for systems with CXL memory. The details of the bug can be found in [1/3]. To avoid similar bugs in the future. A kunit test case for region_intersects() is added in [3/3]. [2/3] is a preparation patch for [3/3]. This patch (of 3): region_intersects() is important because it's used for /dev/mem permission checking. To avoid possible bug of region_intersects() in the future, a kunit test case for region_intersects() is added. Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" Cc: Dan Williams Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: Jonathan Cameron Cc: Dave Jiang Cc: Alison Schofield Cc: Vishal Verma Cc: Ira Weiny Cc: Alistair Popple Cc: Andy Shevchenko Cc: Bjorn Helgaas Cc: Baoquan He Signed-off-by: Andrew Morton --- kernel/resource.c | 20 ++++-- kernel/resource_kunit.c | 143 ++++++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 1 + 3 files changed, 158 insertions(+), 6 deletions(-) diff --git a/kernel/resource.c b/kernel/resource.c index 5e76b98c9771..4c9cb0363f9a 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1817,7 +1817,17 @@ EXPORT_SYMBOL(resource_list_free); #ifdef CONFIG_GET_FREE_REGION #define GFR_DESCENDING (1UL << 0) #define GFR_REQUEST_REGION (1UL << 1) -#define GFR_DEFAULT_ALIGN (1UL << PA_SECTION_SHIFT) +#ifdef PA_SECTION_SHIFT +#define GFR_DEFAULT_ALIGN (1UL << PA_SECTION_SHIFT) +#else +#define GFR_DEFAULT_ALIGN PAGE_SIZE +#endif + +#ifdef MAX_PHYSMEM_BITS +#define MAX_PHYS_ADDR ((1ULL << MAX_PHYSMEM_BITS) - 1) +#else +#define MAX_PHYS_ADDR (-1ULL) +#endif static resource_size_t gfr_start(struct resource *base, resource_size_t size, resource_size_t align, unsigned long flags) @@ -1825,8 +1835,7 @@ static resource_size_t gfr_start(struct resource *base, resource_size_t size, if (flags & GFR_DESCENDING) { resource_size_t end; - end = min_t(resource_size_t, base->end, - (1ULL << MAX_PHYSMEM_BITS) - 1); + end = min_t(resource_size_t, base->end, MAX_PHYS_ADDR); return end - size + 1; } @@ -1843,8 +1852,7 @@ static bool gfr_continue(struct resource *base, resource_size_t addr, * @size did not wrap 0. */ return addr > addr - size && - addr <= min_t(resource_size_t, base->end, - (1ULL << MAX_PHYSMEM_BITS) - 1); + addr <= min_t(resource_size_t, base->end, MAX_PHYS_ADDR); } static resource_size_t gfr_next(resource_size_t addr, resource_size_t size, @@ -2005,7 +2013,7 @@ struct resource *alloc_free_mem_region(struct resource *base, return get_free_mem_region(NULL, base, size, align, name, IORES_DESC_NONE, flags); } -EXPORT_SYMBOL_NS_GPL(alloc_free_mem_region, CXL); +EXPORT_SYMBOL_GPL(alloc_free_mem_region); #endif /* CONFIG_GET_FREE_REGION */ static int __init strict_iomem(char *str) diff --git a/kernel/resource_kunit.c b/kernel/resource_kunit.c index 0e509985a44a..42d2d8d20f5d 100644 --- a/kernel/resource_kunit.c +++ b/kernel/resource_kunit.c @@ -7,6 +7,8 @@ #include #include #include +#include +#include #define R0_START 0x0000 #define R0_END 0xffff @@ -137,9 +139,150 @@ static void resource_test_intersection(struct kunit *test) } while (++i < ARRAY_SIZE(results_for_intersection)); } +/* + * The test resource tree for region_intersects() test: + * + * BASE-BASE+1M-1 : Test System RAM 0 + * # hole 0 (BASE+1M-BASE+2M) + * BASE+2M-BASE+3M-1 : Test CXL Window 0 + * BASE+3M-BASE+4M-1 : Test System RAM 1 + * BASE+4M-BASE+7M-1 : Test CXL Window 1 + * BASE+4M-BASE+5M-1 : Test System RAM 2 + * BASE+4M+128K-BASE+4M+256K-1: Test Code + * BASE+5M-BASE+6M-1 : Test System RAM 3 + */ +#define RES_TEST_RAM0_OFFSET 0 +#define RES_TEST_RAM0_SIZE SZ_1M +#define RES_TEST_HOLE0_OFFSET (RES_TEST_RAM0_OFFSET + RES_TEST_RAM0_SIZE) +#define RES_TEST_HOLE0_SIZE SZ_1M +#define RES_TEST_WIN0_OFFSET (RES_TEST_HOLE0_OFFSET + RES_TEST_HOLE0_SIZE) +#define RES_TEST_WIN0_SIZE SZ_1M +#define RES_TEST_RAM1_OFFSET (RES_TEST_WIN0_OFFSET + RES_TEST_WIN0_SIZE) +#define RES_TEST_RAM1_SIZE SZ_1M +#define RES_TEST_WIN1_OFFSET (RES_TEST_RAM1_OFFSET + RES_TEST_RAM1_SIZE) +#define RES_TEST_WIN1_SIZE (SZ_1M * 3) +#define RES_TEST_RAM2_OFFSET RES_TEST_WIN1_OFFSET +#define RES_TEST_RAM2_SIZE SZ_1M +#define RES_TEST_CODE_OFFSET (RES_TEST_RAM2_OFFSET + SZ_128K) +#define RES_TEST_CODE_SIZE SZ_128K +#define RES_TEST_RAM3_OFFSET (RES_TEST_RAM2_OFFSET + RES_TEST_RAM2_SIZE) +#define RES_TEST_RAM3_SIZE SZ_1M +#define RES_TEST_TOTAL_SIZE ((RES_TEST_WIN1_OFFSET + RES_TEST_WIN1_SIZE)) + +static void remove_free_resource(void *ctx) +{ + struct resource *res = (struct resource *)ctx; + + remove_resource(res); + kfree(res); +} + +static void resource_test_request_region(struct kunit *test, struct resource *parent, + resource_size_t start, resource_size_t size, + const char *name, unsigned long flags) +{ + struct resource *res; + + res = __request_region(parent, start, size, name, flags); + KUNIT_ASSERT_NOT_NULL(test, res); + kunit_add_action_or_reset(test, remove_free_resource, res); +} + +static void resource_test_insert_resource(struct kunit *test, struct resource *parent, + resource_size_t start, resource_size_t size, + const char *name, unsigned long flags) +{ + struct resource *res; + + res = kzalloc(sizeof(*res), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, res); + + res->name = name; + res->start = start; + res->end = start + size - 1; + res->flags = flags; + if (insert_resource(parent, res)) { + kfree(res); + KUNIT_FAIL_AND_ABORT(test, "Fail to insert resource %pR\n", res); + } + + kunit_add_action_or_reset(test, remove_free_resource, res); +} + +static void resource_test_region_intersects(struct kunit *test) +{ + unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; + struct resource *parent; + resource_size_t start; + + /* Find an iomem_resource hole to hold test resources */ + parent = alloc_free_mem_region(&iomem_resource, RES_TEST_TOTAL_SIZE, SZ_1M, + "test resources"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, parent); + start = parent->start; + kunit_add_action_or_reset(test, remove_free_resource, parent); + + resource_test_request_region(test, parent, start + RES_TEST_RAM0_OFFSET, + RES_TEST_RAM0_SIZE, "Test System RAM 0", flags); + resource_test_insert_resource(test, parent, start + RES_TEST_WIN0_OFFSET, + RES_TEST_WIN0_SIZE, "Test CXL Window 0", + IORESOURCE_MEM); + resource_test_request_region(test, parent, start + RES_TEST_RAM1_OFFSET, + RES_TEST_RAM1_SIZE, "Test System RAM 1", flags); + resource_test_insert_resource(test, parent, start + RES_TEST_WIN1_OFFSET, + RES_TEST_WIN1_SIZE, "Test CXL Window 1", + IORESOURCE_MEM); + resource_test_request_region(test, parent, start + RES_TEST_RAM2_OFFSET, + RES_TEST_RAM2_SIZE, "Test System RAM 2", flags); + resource_test_insert_resource(test, parent, start + RES_TEST_CODE_OFFSET, + RES_TEST_CODE_SIZE, "Test Code", flags); + resource_test_request_region(test, parent, start + RES_TEST_RAM3_OFFSET, + RES_TEST_RAM3_SIZE, "Test System RAM 3", flags); + kunit_release_action(test, remove_free_resource, parent); + + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_RAM0_OFFSET, PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_RAM0_OFFSET + + RES_TEST_RAM0_SIZE - PAGE_SIZE, 2 * PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_DISJOINT, + region_intersects(start + RES_TEST_HOLE0_OFFSET, PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_DISJOINT, + region_intersects(start + RES_TEST_HOLE0_OFFSET + + RES_TEST_HOLE0_SIZE - PAGE_SIZE, 2 * PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_MIXED, + region_intersects(start + RES_TEST_WIN0_OFFSET + + RES_TEST_WIN0_SIZE - PAGE_SIZE, 2 * PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_RAM1_OFFSET + + RES_TEST_RAM1_SIZE - PAGE_SIZE, 2 * PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_RAM2_OFFSET + + RES_TEST_RAM2_SIZE - PAGE_SIZE, 2 * PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_CODE_OFFSET, PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_INTERSECTS, + region_intersects(start + RES_TEST_RAM2_OFFSET, + RES_TEST_RAM2_SIZE + PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); + KUNIT_EXPECT_EQ(test, REGION_MIXED, + region_intersects(start + RES_TEST_RAM3_OFFSET, + RES_TEST_RAM3_SIZE + PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE)); +} + static struct kunit_case resource_test_cases[] = { KUNIT_CASE(resource_test_union), KUNIT_CASE(resource_test_intersection), + KUNIT_CASE(resource_test_region_intersects), {} }; diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a30c03a66172..383453cbf4af 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2616,6 +2616,7 @@ config RESOURCE_KUNIT_TEST tristate "KUnit test for resource API" if !KUNIT_ALL_TESTS depends on KUNIT default KUNIT_ALL_TESTS + select GET_FREE_REGION help This builds the resource API unit test. Tests the logic of API provided by resource.c and ioport.h. From aa549f923f5e037a459dcd588932db9abfa8c158 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 6 Sep 2024 10:42:00 +0800 Subject: [PATCH 403/416] mm: support poison recovery from do_cow_fault() Patch series "mm: hwpoison: two more poison recovery". One more CoW path to support poison recorvery in do_cow_fault(), and the last copy_user_highpage() user is replaced to copy_mc_user_highpage() from copy_present_page() during fork to support poison recorvery too. This patch (of 2): Like commit a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults"), there is another path which could crash because it does not have recovery code where poison is consumed by the kernel in do_cow_fault(), a crash calltrace shown below on old kernel, but it could be happened in the lastest mainline code, CPU: 7 PID: 3248 Comm: mpi Kdump: loaded Tainted: G OE 5.10.0 #1 pc : copy_page+0xc/0xbc lr : copy_user_highpage+0x50/0x9c Call trace: copy_page+0xc/0xbc do_cow_fault+0x118/0x2bc do_fault+0x40/0x1a4 handle_pte_fault+0x154/0x230 __handle_mm_fault+0x1a8/0x38c handle_mm_fault+0xf0/0x250 do_page_fault+0x184/0x454 do_translation_fault+0xac/0xd4 do_mem_abort+0x44/0xbc Fix it by using copy_mc_user_highpage() to handle this case and return VM_FAULT_HWPOISON for cow fault. [wangkefeng.wang@huawei.com: unlock/put vmf->page, per Miaohe] Link: https://lkml.kernel.org/r/20240910021541.234300-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240906024201.1214712-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240906024201.1214712-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Jane Chu Reviewed-by: Miaohe Lin Cc: David Hildenbrand Cc: Jiaqi Yan Cc: Naoya Horiguchi Cc: Tony Luck Signed-off-by: Andrew Morton --- mm/memory.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index 42674c0748cb..64421ea5a403 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5089,10 +5089,14 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (ret & VM_FAULT_DONE_COW) return ret; - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); + if (copy_mc_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma)) { + ret = VM_FAULT_HWPOISON; + goto unlock; + } __folio_mark_uptodate(folio); ret |= finish_fault(vmf); +unlock: unlock_page(vmf->page); put_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) From 658be46520ce480a44fe405730a1725166298f27 Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 6 Sep 2024 10:42:01 +0800 Subject: [PATCH 404/416] mm: support poison recovery from copy_present_page() Similar to other poison recovery, use copy_mc_user_highpage() to avoid potentially kernel panic during copy page in copy_present_page() from fork, once copy failed due to hwpoison in source page, we need to break out of copy in copy_pte_range() and release prealloc folio, so copy_mc_user_highpage() is moved ahead before set *prealloc to NULL. Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reviewed-by: Jane Chu Reviewed-by: Miaohe Lin Cc: David Hildenbrand Cc: Jiaqi Yan Cc: Naoya Horiguchi Cc: Tony Luck Signed-off-by: Andrew Morton --- mm/memory.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 64421ea5a403..cca3d2907439 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -926,8 +926,11 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma * We have a prealloc page, all good! Take it * over and copy the page & arm it. */ + + if (copy_mc_user_highpage(&new_folio->page, page, addr, src_vma)) + return -EHWPOISON; + *prealloc = NULL; - copy_user_highpage(&new_folio->page, page, addr, src_vma); __folio_mark_uptodate(new_folio); folio_add_new_anon_rmap(new_folio, dst_vma, addr, RMAP_EXCLUSIVE); folio_add_lru_vma(new_folio, dst_vma); @@ -1166,8 +1169,9 @@ again: /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. + * If copy failed due to hwpoison in source page, break out. */ - if (unlikely(ret == -EAGAIN)) + if (unlikely(ret == -EAGAIN || ret == -EHWPOISON)) break; if (unlikely(prealloc)) { /* @@ -1197,7 +1201,7 @@ again: goto out; } entry.val = 0; - } else if (ret == -EBUSY) { + } else if (ret == -EBUSY || unlikely(ret == -EHWPOISON)) { goto out; } else if (ret == -EAGAIN) { prealloc = folio_prealloc(src_mm, src_vma, addr, false); From fd00be9afa1d64c90ae20a1307da1bdb809b3d55 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 5 Sep 2024 20:53:37 -0400 Subject: [PATCH 405/416] mm/show_mem.c: report alloc tags in human readable units We already do this when reporting slab info - more consistent and more readable. Link: https://lkml.kernel.org/r/20240906005337.1220091-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet Reviewed-by: Suren Baghdasaryan Signed-off-by: Andrew Morton --- mm/show_mem.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/mm/show_mem.c b/mm/show_mem.c index bdb439551eef..ec885a398fa0 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -435,15 +435,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) struct codetag *ct = tags[i].ct; struct alloc_tag *tag = ct_to_alloc_tag(ct); struct alloc_tag_counters counter = alloc_tag_read(tag); + char bytes[10]; + + string_get_size(counter.bytes, 1, STRING_UNITS_2, bytes, sizeof(bytes)); /* Same as alloc_tag_to_text() but w/o intermediate buffer */ if (ct->modname) - pr_notice("%12lli %8llu %s:%u [%s] func:%s\n", - counter.bytes, counter.calls, ct->filename, + pr_notice("%12s %8llu %s:%u [%s] func:%s\n", + bytes, counter.calls, ct->filename, ct->lineno, ct->modname, ct->function); else - pr_notice("%12lli %8llu %s:%u func:%s\n", - counter.bytes, counter.calls, ct->filename, + pr_notice("%12s %8llu %s:%u func:%s\n", + bytes, counter.calls, ct->filename, ct->lineno, ct->function); } } From f2c5101be43677c227974912a043da29a62743ef Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Mon, 9 Sep 2024 18:32:20 +0200 Subject: [PATCH 406/416] memcg: cleanup with !CONFIG_MEMCG_V1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extern declarations have no definitions with !CONFIG_MEMCG_V1 and no users, drop them altogether. Link: https://lkml.kernel.org/r/20240909163223.3693529-1-mkoutny@suse.com Link: https://lkml.kernel.org/r/20240909163223.3693529-2-mkoutny@suse.com Signed-off-by: Michal Koutný Acked-by: Shakeel Butt Acked-by: Tejun Heo Cc: Chen Ridong Cc: Johannes Weiner Cc: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Zefan Li Cc: Waiman Long Signed-off-by: Andrew Morton --- mm/memcontrol-v1.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 3bb8b3030e61..c0672e25bcdb 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -154,8 +154,6 @@ static inline bool memcg1_charge_skmem(struct mem_cgroup *memcg, unsigned int nr gfp_t gfp_mask) { return true; } static inline void memcg1_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) {} -extern struct cftype memsw_files[]; -extern struct cftype mem_cgroup_legacy_files[]; #endif /* CONFIG_MEMCG_V1 */ #endif /* __MM_MEMCONTROL_V1_H */ From 659c55ef981bb63355a65ffc3b3b5cad562b806a Mon Sep 17 00:00:00 2001 From: Xiao Yang Date: Mon, 9 Sep 2024 21:56:21 +0900 Subject: [PATCH 407/416] mm/vma: return the exact errno in vms_gather_munmap_vmas() __split_vma() and mas_store_gfp() returns several types of errno on failure so don't ignore them in vms_gather_munmap_vmas(). For example, __split_vma() returns -EINVAL when an unaligned huge page is unmapped. This issue is reproduced by ltp memfd_create03 test. Don't initialise the error variable and assign it when a failure actually occurs. [akpm@linux-foundation.org: fix whitespace, per Liam] Link: https://lkml.kernel.org/r/20240909125621.1994-1-ice_yangxiao@163.com Fixes: 6898c9039bc8 ("mm/vma: extract the gathering of vmas from do_vmi_align_munmap()") Signed-off-by: Xiao Yang Suggested-by: Lorenzo Stoakes Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202409081536.d283a0fb-oliver.sang@intel.com Cc: "Liam R. Howlett" Signed-off-by: Andrew Morton --- mm/vma.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index 8d1686fc8d5a..4737afcb064c 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1171,13 +1171,13 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, * @vms: The vma munmap struct * @mas_detach: The maple state tracking the detached tree * - * Return: 0 on success, -EPERM on mseal vmas, -ENOMEM otherwise + * Return: 0 on success, error otherwise */ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach) { struct vm_area_struct *next = NULL; - int error = -ENOMEM; + int error; /* * If we need to split any vma, do it now to save pain later. @@ -1191,8 +1191,10 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && - vms->vma->vm_mm->map_count >= sysctl_max_map_count) + vms->vma->vm_mm->map_count >= sysctl_max_map_count) { + error = -ENOMEM; goto map_count_exceeded; + } /* Don't bother splitting the VMA if we can't unmap it anyway */ if (!can_modify_vma(vms->vma)) { @@ -1200,7 +1202,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, goto start_split_failed; } - if (__split_vma(vms->vmi, vms->vma, vms->start, 1)) + error = __split_vma(vms->vmi, vms->vma, vms->start, 1); + if (error) goto start_split_failed; } vms->prev = vma_prev(vms->vmi); @@ -1220,12 +1223,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, } /* Does it split the end? */ if (next->vm_end > vms->end) { - if (__split_vma(vms->vmi, next, vms->end, 0)) + error = __split_vma(vms->vmi, next, vms->end, 0); + if (error) goto end_split_failed; } vma_start_write(next); mas_set(mas_detach, vms->vma_count++); - if (mas_store_gfp(mas_detach, next, GFP_KERNEL)) + error = mas_store_gfp(mas_detach, next, GFP_KERNEL); + if (error) goto munmap_gather_failed; vma_mark_detached(next, true); @@ -1255,8 +1260,9 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * split, despite we could. This is unlikely enough * failure that it's not worth optimizing it for. */ - if (userfaultfd_unmap_prep(next, vms->start, vms->end, - vms->uf)) + error = userfaultfd_unmap_prep(next, vms->start, + vms->end, vms->uf); + if (error) goto userfaultfd_error; } #ifdef CONFIG_DEBUG_VM_MAPLE_TREE From 82ce8e2f31a1eb05b1527c3d807bea40031df913 Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Sat, 7 Sep 2024 17:40:42 +0200 Subject: [PATCH 408/416] set_memory: add __must_check to generic stubs Following query shows that architectures that don't provide asm/set_memory.h don't use set_memory_...() functions. $ git grep set_memory_ alpha arc csky hexagon loongarch m68k microblaze mips nios2 openrisc parisc sh sparc um xtensa Following query shows that all core users of set_memory_...() functions always take returned value into account: $ git grep -w -e set_memory_ro -e set_memory_rw -e set_memory_x -e set_memory_nx -e set_memory_rox `find . -maxdepth 1 -type d | grep -v arch | grep /` set_memory_...() functions can fail, leaving the memory attributes unchanged. Make sure all callers check the returned code. Link: https://github.com/KSPP/linux/issues/7 Link: https://lkml.kernel.org/r/6a89ffc69666de84721216947c6b6c7dcca39d7d.1725723347.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Cc: Arnd Bergmann Cc: Kees Cook Signed-off-by: Andrew Morton --- include/linux/set_memory.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index 95ac8398ee72..e7aec20fb44f 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -8,10 +8,10 @@ #ifdef CONFIG_ARCH_HAS_SET_MEMORY #include #else -static inline int set_memory_ro(unsigned long addr, int numpages) { return 0; } -static inline int set_memory_rw(unsigned long addr, int numpages) { return 0; } -static inline int set_memory_x(unsigned long addr, int numpages) { return 0; } -static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; } +static inline int __must_check set_memory_ro(unsigned long addr, int numpages) { return 0; } +static inline int __must_check set_memory_rw(unsigned long addr, int numpages) { return 0; } +static inline int __must_check set_memory_x(unsigned long addr, int numpages) { return 0; } +static inline int __must_check set_memory_nx(unsigned long addr, int numpages) { return 0; } #endif #ifndef set_memory_rox From a0c9fd22e312caa5566d4f3924e37d8158b997cc Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Tue, 10 Sep 2024 17:27:46 +0530 Subject: [PATCH 409/416] mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries This replaces all the existing READ_ONCE() based page table accesses with respective pxdp_get() helpers. Although these helpers might also fallback to READ_ONCE() as default, but they do provide an opportunity for various platforms to override when required. This change is a step in direction to replace all page table entry accesses with respective pxdp_get() helpers. Link: https://lkml.kernel.org/r/20240910115746.514454-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Acked-by: David Hildenbrand Cc: Ryan Roberts Signed-off-by: Andrew Morton --- mm/debug_vm_pgtable.c | 50 +++++++++++++++++++++---------------------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index e4969fb54da3..bc748f700a9e 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -231,10 +231,10 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) set_pmd_at(args->mm, vaddr, args->pmdp, pmd); flush_dcache_page(page); pmdp_set_wrprotect(args->mm, vaddr, args->pmdp); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(pmd_write(pmd)); pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(!pmd_none(pmd)); pmd = pfn_pmd(args->pmd_pfn, args->page_prot); @@ -245,10 +245,10 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) pmd = pmd_mkwrite(pmd, args->vma); pmd = pmd_mkdirty(pmd); pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd))); pmdp_huge_get_and_clear_full(args->vma, vaddr, args->pmdp, 1); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(!pmd_none(pmd)); pmd = pmd_mkhuge(pfn_pmd(args->pmd_pfn, args->page_prot)); @@ -256,7 +256,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args) set_pmd_at(args->mm, vaddr, args->pmdp, pmd); flush_dcache_page(page); pmdp_test_and_clear_young(args->vma, vaddr, args->pmdp); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(pmd_young(pmd)); /* Clear the pte entries */ @@ -357,12 +357,12 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) set_pud_at(args->mm, vaddr, args->pudp, pud); flush_dcache_page(page); pudp_set_wrprotect(args->mm, vaddr, args->pudp); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(pud_write(pud)); #ifndef __PAGETABLE_PMD_FOLDED pudp_huge_get_and_clear(args->mm, vaddr, args->pudp); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ pud = pfn_pud(args->pud_pfn, args->page_prot); @@ -374,12 +374,12 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) pud = pud_mkwrite(pud); pud = pud_mkdirty(pud); pudp_set_access_flags(args->vma, vaddr, args->pudp, pud, 1); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(!(pud_write(pud) && pud_dirty(pud))); #ifndef __PAGETABLE_PMD_FOLDED pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ @@ -389,7 +389,7 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) set_pud_at(args->mm, vaddr, args->pudp, pud); flush_dcache_page(page); pudp_test_and_clear_young(args->vma, vaddr, args->pudp); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(pud_young(pud)); pudp_huge_get_and_clear(args->mm, vaddr, args->pudp); @@ -441,7 +441,7 @@ static void __init pmd_huge_tests(struct pgtable_debug_args *args) WRITE_ONCE(*args->pmdp, __pmd(0)); WARN_ON(!pmd_set_huge(args->pmdp, __pfn_to_phys(args->fixed_pmd_pfn), args->page_prot)); WARN_ON(!pmd_clear_huge(args->pmdp)); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(!pmd_none(pmd)); } @@ -461,7 +461,7 @@ static void __init pud_huge_tests(struct pgtable_debug_args *args) WRITE_ONCE(*args->pudp, __pud(0)); WARN_ON(!pud_set_huge(args->pudp, __pfn_to_phys(args->fixed_pud_pfn), args->page_prot)); WARN_ON(!pud_clear_huge(args->pudp)); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(!pud_none(pud)); } #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ @@ -490,7 +490,7 @@ static void __init pgd_basic_tests(struct pgtable_debug_args *args) #ifndef __PAGETABLE_PUD_FOLDED static void __init pud_clear_tests(struct pgtable_debug_args *args) { - pud_t pud = READ_ONCE(*args->pudp); + pud_t pud = pudp_get(args->pudp); if (mm_pmd_folded(args->mm)) return; @@ -498,7 +498,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args) pr_debug("Validating PUD clear\n"); WARN_ON(pud_none(pud)); pud_clear(args->pudp); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(!pud_none(pud)); } @@ -515,7 +515,7 @@ static void __init pud_populate_tests(struct pgtable_debug_args *args) * Hence this must not qualify as pud_bad(). */ pud_populate(args->mm, args->pudp, args->start_pmdp); - pud = READ_ONCE(*args->pudp); + pud = pudp_get(args->pudp); WARN_ON(pud_bad(pud)); } #else /* !__PAGETABLE_PUD_FOLDED */ @@ -526,7 +526,7 @@ static void __init pud_populate_tests(struct pgtable_debug_args *args) { } #ifndef __PAGETABLE_P4D_FOLDED static void __init p4d_clear_tests(struct pgtable_debug_args *args) { - p4d_t p4d = READ_ONCE(*args->p4dp); + p4d_t p4d = p4dp_get(args->p4dp); if (mm_pud_folded(args->mm)) return; @@ -534,7 +534,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args) pr_debug("Validating P4D clear\n"); WARN_ON(p4d_none(p4d)); p4d_clear(args->p4dp); - p4d = READ_ONCE(*args->p4dp); + p4d = p4dp_get(args->p4dp); WARN_ON(!p4d_none(p4d)); } @@ -553,13 +553,13 @@ static void __init p4d_populate_tests(struct pgtable_debug_args *args) pud_clear(args->pudp); p4d_clear(args->p4dp); p4d_populate(args->mm, args->p4dp, args->start_pudp); - p4d = READ_ONCE(*args->p4dp); + p4d = p4dp_get(args->p4dp); WARN_ON(p4d_bad(p4d)); } static void __init pgd_clear_tests(struct pgtable_debug_args *args) { - pgd_t pgd = READ_ONCE(*(args->pgdp)); + pgd_t pgd = pgdp_get(args->pgdp); if (mm_p4d_folded(args->mm)) return; @@ -567,7 +567,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args) pr_debug("Validating PGD clear\n"); WARN_ON(pgd_none(pgd)); pgd_clear(args->pgdp); - pgd = READ_ONCE(*args->pgdp); + pgd = pgdp_get(args->pgdp); WARN_ON(!pgd_none(pgd)); } @@ -586,7 +586,7 @@ static void __init pgd_populate_tests(struct pgtable_debug_args *args) p4d_clear(args->p4dp); pgd_clear(args->pgdp); pgd_populate(args->mm, args->pgdp, args->start_p4dp); - pgd = READ_ONCE(*args->pgdp); + pgd = pgdp_get(args->pgdp); WARN_ON(pgd_bad(pgd)); } #else /* !__PAGETABLE_P4D_FOLDED */ @@ -627,12 +627,12 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args) static void __init pmd_clear_tests(struct pgtable_debug_args *args) { - pmd_t pmd = READ_ONCE(*args->pmdp); + pmd_t pmd = pmdp_get(args->pmdp); pr_debug("Validating PMD clear\n"); WARN_ON(pmd_none(pmd)); pmd_clear(args->pmdp); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(!pmd_none(pmd)); } @@ -646,7 +646,7 @@ static void __init pmd_populate_tests(struct pgtable_debug_args *args) * Hence this must not qualify as pmd_bad(). */ pmd_populate(args->mm, args->pmdp, args->start_ptep); - pmd = READ_ONCE(*args->pmdp); + pmd = pmdp_get(args->pmdp); WARN_ON(pmd_bad(pmd)); } @@ -1251,7 +1251,7 @@ static int __init init_args(struct pgtable_debug_args *args) ret = -ENOMEM; goto error; } - args->start_ptep = pmd_pgtable(READ_ONCE(*args->pmdp)); + args->start_ptep = pmd_pgtable(pmdp_get(args->pmdp)); WARN_ON(!args->start_ptep); init_fixed_pfns(args); From 9d57090e73d5e00e946d7fdd6398c2c0bc3b5525 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Mon, 9 Sep 2024 11:21:17 +1200 Subject: [PATCH 410/416] mm: fix swap_read_folio_zeromap() for large folios with partial zeromap Patch series "mm: enable large folios swap-in support", v9. Currently, we support mTHP swapout but not swapin. This means that once mTHP is swapped out, it will come back as small folios when swapped in. This is particularly detrimental for devices like Android, where more than half of the memory is in swap. The lack of mTHP swapin functionality makes mTHP a showstopper in scenarios that heavily rely on swap. This patchset introduces mTHP swap-in support. It starts with synchronous devices similar to zRAM, aiming to benefit as many users as possible with minimal changes. This patch (of 3): There could be a corner case where the first entry is non-zeromap, but a subsequent entry is zeromap. In this case, we should not let swap_read_folio_zeromap() return false since we will still read corrupted data. Additionally, the iteration of test_bit() is unnecessary and can be replaced with bitmap operations, which are more efficient. We can adopt the style of swap_pte_batch() and folio_pte_batch() to introduce swap_zeromap_batch() which seems to provide the greatest flexibility for the caller. This approach allows the caller to either check if the zeromap status of all entries is consistent or determine the number of contiguous entries with the same status. Since swap_read_folio() can't handle reading a large folio that's partially zeromap and partially non-zeromap, we've moved the code to mm/swap.h so that others, like those working on swap-in, can access it. Link: https://lkml.kernel.org/r/20240908232119.2157-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240908232119.2157-2-21cnbao@gmail.com Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") Signed-off-by: Barry Song Reviewed-by: Yosry Ahmed Reviewed-by: Usama Arif Cc: Baolin Wang Cc: Chris Li Cc: Christoph Hellwig Cc: Chuanhua Han Cc: David Hildenbrand Cc: Gao Xiang Cc: Huang Ying Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kairui Song Cc: Kairui Song Cc: Kalesh Singh Cc: Matthew Wilcox Cc: Michal Hocko Cc: Minchan Kim Cc: Nhat Pham Cc: Ryan Roberts Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Yang Shi Cc: Kanchana P Sridhar Signed-off-by: Andrew Morton --- mm/page_io.c | 32 +++++++------------------------- mm/swap.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 25 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index b6f1519d63b0..78bc88acee79 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio) } } -/* - * Return the index of the first subpage which is not zero-filled - * according to swap_info_struct->zeromap. - * If all pages are zero-filled according to zeromap, it will return - * folio_nr_pages(folio). - */ -static unsigned int swap_zeromap_folio_test(struct folio *folio) -{ - struct swap_info_struct *sis = swp_swap_info(folio->swap); - swp_entry_t entry; - unsigned int i; - - for (i = 0; i < folio_nr_pages(folio); i++) { - entry = page_swap_entry(folio_page(folio, i)); - if (!test_bit(swp_offset(entry), sis->zeromap)) - return i; - } - return i; -} - /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. @@ -522,19 +502,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret) static bool swap_read_folio_zeromap(struct folio *folio) { - unsigned int idx = swap_zeromap_folio_test(folio); - - if (idx == 0) - return false; + int nr_pages = folio_nr_pages(folio); + bool is_zeromap; /* * Swapping in a large folio that is partially in the zeromap is not * currently handled. Return true without marking the folio uptodate so * that an IO error is emitted (e.g. do_swap_page() will sigbus). */ - if (WARN_ON_ONCE(idx < folio_nr_pages(folio))) + if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages, + &is_zeromap) != nr_pages)) return true; + if (!is_zeromap) + return false; + folio_zero_range(folio, 0, folio_size(folio)); folio_mark_uptodate(folio); return true; diff --git a/mm/swap.h b/mm/swap.h index f8711ff82f84..ad2f121de970 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio) { return swp_swap_info(folio->swap)->flags; } + +/* + * Return the count of contiguous swap entries that share the same + * zeromap status as the starting entry. If is_zeromap is not NULL, + * it will return the zeromap status of the starting entry. + */ +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, + bool *is_zeromap) +{ + struct swap_info_struct *sis = swp_swap_info(entry); + unsigned long start = swp_offset(entry); + unsigned long end = start + max_nr; + bool first_bit; + + first_bit = test_bit(start, sis->zeromap); + if (is_zeromap) + *is_zeromap = first_bit; + + if (max_nr <= 1) + return max_nr; + if (first_bit) + return find_next_zero_bit(sis->zeromap, end, start) - start; + else + return find_next_bit(sis->zeromap, end, start) - start; +} + #else /* CONFIG_SWAP */ struct swap_iocb; static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug) @@ -171,6 +197,13 @@ static inline unsigned int folio_swap_flags(struct folio *folio) { return 0; } + +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr, + bool *has_zeromap) +{ + return 0; +} + #endif /* CONFIG_SWAP */ #endif /* _MM_SWAP_H */ From 325efb16da2c840e165d9b620fec8049d4d664cc Mon Sep 17 00:00:00 2001 From: Barry Song Date: Mon, 9 Sep 2024 11:21:18 +1200 Subject: [PATCH 411/416] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios With large folios swap-in, we might need to uncharge multiple entries all together, add nr argument in mem_cgroup_swapin_uncharge_swap(). For the existing two users, just pass nr=1. Link: https://lkml.kernel.org/r/20240908232119.2157-3-21cnbao@gmail.com Signed-off-by: Barry Song Acked-by: Chris Li Reviewed-by: Yosry Ahmed Cc: Shakeel Butt Cc: Baolin Wang Cc: Christoph Hellwig Cc: David Hildenbrand Cc: Gao Xiang Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kairui Song Cc: Kairui Song Cc: Kalesh Singh Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Minchan Kim Cc: Nhat Pham Cc: Ryan Roberts Cc: Sergey Senozhatsky Cc: Suren Baghdasaryan Cc: Yang Shi Cc: Chuanhua Han Cc: Kanchana P Sridhar Cc: Usama Arif Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 5 +++-- mm/memcontrol.c | 7 ++++--- mm/memory.c | 2 +- mm/swap_state.c | 2 +- 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2ef94c74847d..34d2da05f2f1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -699,7 +699,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); + +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); void __mem_cgroup_uncharge(struct folio *folio); @@ -1206,7 +1207,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, return 0; } -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d632650c6a2b..ae9a9d7968cb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4559,14 +4559,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, /* * mem_cgroup_swapin_uncharge_swap - uncharge swap slot - * @entry: swap entry for which the page is charged + * @entry: the first swap entry for which the pages are charged + * @nr_pages: number of pages which will be uncharged * * Call this function after successfully adding the charged page to swapcache. * * Note: This function assumes the page for which swap slot is being uncharged * is order 0 page. */ -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) { /* * Cgroup1's unified memory+swap counter has been charged with the @@ -4586,7 +4587,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) * let's not wait for it. The page already received a * memory+swap charge, drop the swap entry duplicate. */ - mem_cgroup_uncharge_swap(entry, 1); + mem_cgroup_uncharge_swap(entry, nr_pages); } } diff --git a/mm/memory.c b/mm/memory.c index cca3d2907439..9bcc440dd56e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4104,7 +4104,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap(entry, 1); shadow = get_shadow_from_swap_cache(entry); if (shadow) diff --git a/mm/swap_state.c b/mm/swap_state.c index a042720554a7..4669f29cf555 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) goto fail_unlock; - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap(entry, 1); if (shadow) workingset_refault(new_folio, shadow); From 242d12c98174584a18965cfab95778893872d650 Mon Sep 17 00:00:00 2001 From: Chuanhua Han Date: Mon, 9 Sep 2024 11:21:19 +1200 Subject: [PATCH 412/416] mm: support large folios swap-in for sync io devices Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once these large folios are swapped out, they are lost because mTHP swap is a one-way process. The lack of mTHP swap-in functionality prevents mTHP from being used on devices like Android that heavily rely on swap. This patch introduces mTHP swap-in support. It starts from sync devices such as zRAM. This is probably the simplest and most common use case, benefiting billions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. Large folios in the buddy system are also preserved as much as possible, rather than being fragmented due to swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT. w/o this patch (Refer to the data from Chris's and Kairui's latest swap allocator optimization while running ./thp_swap_allocator_test w/o "-a" option [1]): ./thp_swap_allocator_test Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53% Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58% Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34% Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51% Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84% Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91% Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05% Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25% Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74% Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01% Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45% Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98% Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64% Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36% Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02% Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07% w/ this patch (always 0%): Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00% Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00% ... 3. With both mTHP swap-out and swap-in supported, we offer the option to enable zsmalloc compression/decompression with larger granularity[2]. The upcoming optimization in zsmalloc will significantly increase swap speed and improve compression efficiency. Tested by running 100 iterations of swapping 100MiB of anon memory, the swap speed improved dramatically: time consumption of swapin(ms) time consumption of swapout(ms) lz4 4k 45274 90540 lz4 64k 22942 55667 zstdn 4k 85035 186585 zstdn 64k 46558 118533 The compression ratio also improved, as evaluated with 1 GiB of data: granularity orig_data_size compr_data_size 4KiB-zstd 1048576000 246876055 64KiB-zstd 1048576000 199763892 Without mTHP swap-in, the potential optimizations in zsmalloc cannot be realized. 4. Even mTHP swap-in itself can reduce swap-in page faults by a factor of nr_pages. Swapping in content filled with the same data 0x11, w/o and w/ the patch for five rounds (Since the content is the same, decompression will be very fast. This primarily assesses the impact of reduced page faults): swp in bandwidth(bytes/ms) w/o w/ round1 624152 1127501 round2 631672 1127501 round3 620459 1139756 round4 606113 1139756 round5 624152 1152281 avg 621310 1137359 +83% 5. With both mTHP swap-out and swap-in supported, we offer the option to enable hardware accelerators(Intel IAA) to do parallel decompression with which Kanchana reported 7X improvement on zRAM read latency[3]. [1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ [2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/ [3] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/ Link: https://lkml.kernel.org/r/20240908232119.2157-4-21cnbao@gmail.com Signed-off-by: Chuanhua Han Co-developed-by: Barry Song Signed-off-by: Barry Song Cc: Baolin Wang Cc: Chris Li Cc: Christoph Hellwig Cc: David Hildenbrand Cc: Gao Xiang Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kairui Song Cc: Kalesh Singh Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Minchan Kim Cc: Nhat Pham Cc: Ryan Roberts Cc: Sergey Senozhatsky Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Yang Shi Cc: Yosry Ahmed Cc: Usama Arif Cc: Kanchana P Sridhar Cc: Kairui Song Signed-off-by: Andrew Morton --- mm/memory.c | 261 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 234 insertions(+), 27 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9bcc440dd56e..6469ac99f2f7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3989,6 +3989,194 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +static struct folio *__alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct folio *folio; + swp_entry_t entry; + + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, + vmf->address, false); + if (!folio) + return NULL; + + entry = pte_to_swp_entry(vmf->orig_pte); + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, + GFP_KERNEL, entry)) { + folio_put(folio); + return NULL; + } + + return folio; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) +{ + struct swap_info_struct *si = swp_swap_info(entry); + pgoff_t offset = swp_offset(entry); + int i; + + /* + * While allocating a large folio and doing swap_read_folio, which is + * the case the being faulted pte doesn't have swapcache. We need to + * ensure all PTEs have no cache as well, otherwise, we might go to + * swap devices while the content is in swapcache. + */ + for (i = 0; i < max_nr; i++) { + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) + return i; + } + + return i; +} + +/* + * Check if the PTEs within a range are contiguous swap entries + * and have consistent swapcache, zeromap. + */ +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + unsigned long addr; + swp_entry_t entry; + int idx; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + /* + * swap_read_folio() can't handle the case a large folio is hybridly + * from different backends. And they are likely corner cases. Similar + * things might be added once zswap support large folios. + */ + if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) + return false; + if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) + return false; + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, + unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap in a THP with nr pages, we require that its first swap_offset + * is aligned with that number, as it was when the THP was swapped out. + * This helps filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), + vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, + gfp, entry)) + return folio; + folio_put(folio); + } + order = next_order(&orders, order); + } + +fallback: + return __alloc_swap_folio(vmf); +} +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + return __alloc_swap_folio(vmf); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4077,34 +4265,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry, 1)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); - if (mem_cgroup_swapin_charge_folio(folio, - vma->vm_mm, GFP_KERNEL, - entry)) { - ret = VM_FAULT_OOM; + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread + * may finish swapin first, free the entry, and + * swapout reusing the same entry. It's + * undetectable as pte_same() returns true due + * to entry reuse. + */ + if (swapcache_prepare(entry, nr_pages)) { + /* + * Relax a bit to prevent rapid + * repeated page faults. + */ + schedule_timeout_uninterruptible(1); goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry, 1); + need_clear_cache = true; + + mem_cgroup_swapin_uncharge_swap(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4210,6 +4398,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + pte_t folio_pte = ptep_get(folio_ptep); + + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || + swap_pte_batch(folio_ptep, nr, folio_pte) != nr) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4341,11 +4547,12 @@ check_folio: folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large + * folios which are fully exclusive. If we ever get large + * folios within swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4388,7 +4595,7 @@ unlock: out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4404,7 +4611,7 @@ out_release: folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry, 1); + swapcache_clear(si, entry, nr_pages); if (si) put_swap_device(si); return ret; From ed8d5b0ce1d738e13c60d6b1a901a56d832e5070 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Wed, 11 Sep 2024 15:13:20 +0200 Subject: [PATCH 413/416] Revert "uprobes: use vm_special_mapping close() functionality" This reverts commit 08e28de1160a712724268fd33d77b32f1bc84d1c. A malicious application can munmap() its "[uprobes]" vma and in this case xol_mapping.close == uprobe_clear_state() will free the memory which can be used by another thread, or the same thread when it hits the uprobe bp afterwards. Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com Signed-off-by: Oleg Nesterov Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Linus Torvalds Cc: Mark Rutland Cc: Masami Hiramatsu Cc: Michael Ellerman Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Sven Schnelle Signed-off-by: Andrew Morton --- include/linux/uprobes.h | 1 + kernel/events/uprobes.c | 36 +++++++++++++++++++----------------- kernel/fork.c | 1 + 3 files changed, 21 insertions(+), 17 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 493dc95d912c..b503fafb7fb3 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -126,6 +126,7 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs); extern void uprobe_notify_resume(struct pt_regs *regs); extern bool uprobe_deny_signal(void); extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); +extern void uprobe_clear_state(struct mm_struct *mm); extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c2d6b2d64de2..73cc47708679 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1482,22 +1482,6 @@ void * __weak arch_uprobe_trampoline(unsigned long *psize) return &insn; } -/* - * uprobe_clear_state - Free the area allocated for slots. - */ -static void uprobe_clear_state(const struct vm_special_mapping *sm, struct vm_area_struct *vma) -{ - struct xol_area *area = container_of(vma->vm_private_data, struct xol_area, xol_mapping); - - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(NULL, vma->vm_mm); - mutex_unlock(&delayed_uprobe_lock); - - put_page(area->pages[0]); - kfree(area->bitmap); - kfree(area); -} - static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; @@ -1516,7 +1500,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) area->xol_mapping.name = "[uprobes]"; area->xol_mapping.fault = NULL; - area->xol_mapping.close = uprobe_clear_state; area->xol_mapping.pages = area->pages; area->pages[0] = alloc_page(GFP_HIGHUSER); if (!area->pages[0]) @@ -1562,6 +1545,25 @@ static struct xol_area *get_xol_area(void) return area; } +/* + * uprobe_clear_state - Free the area allocated for slots. + */ +void uprobe_clear_state(struct mm_struct *mm) +{ + struct xol_area *area = mm->uprobes_state.xol_area; + + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(NULL, mm); + mutex_unlock(&delayed_uprobe_lock); + + if (!area) + return; + + put_page(area->pages[0]); + kfree(area->bitmap); + kfree(area); +} + void uprobe_start_dup_mmap(void) { percpu_down_read(&dup_mmap_sem); diff --git a/kernel/fork.c b/kernel/fork.c index 61070248a7d3..3d590e51ce84 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1338,6 +1338,7 @@ static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); + uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ From 6d27a31ef195951c9b03098edfdf986549a213b7 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Wed, 11 Sep 2024 15:14:07 +0200 Subject: [PATCH 414/416] uprobes: introduce the global struct vm_special_mapping xol_mapping Currently each xol_area has its own instance of vm_special_mapping, this is suboptimal and ugly. Kill xol_area->xol_mapping and add a single global instance of vm_special_mapping, the ->fault() method can use area->pages rather than xol_mapping->pages. As a side effect this fixes the problem introduced by the recent commit 223febc6e557 ("mm: add optional close() to struct vm_special_mapping"), if special_mapping_close() is called from the __mmput() paths, it will use vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state(). Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Oleg Nesterov Reported-by: Sven Schnelle Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/ Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Linus Torvalds Cc: Mark Rutland Cc: Masami Hiramatsu Cc: Michael Ellerman Cc: Namhyung Kim Cc: Peter Zijlstra Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 73cc47708679..a478e028043f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -99,7 +99,6 @@ struct xol_area { atomic_t slot_count; /* number of in-use slots */ unsigned long *bitmap; /* 0 = free slot */ - struct vm_special_mapping xol_mapping; struct page *pages[2]; /* * We keep the vma's vm_start rather than a pointer to the vma @@ -1433,6 +1432,21 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags); } +static vm_fault_t xol_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct xol_area *area = vma->vm_mm->uprobes_state.xol_area; + + vmf->page = area->pages[0]; + get_page(vmf->page); + return 0; +} + +static const struct vm_special_mapping xol_mapping = { + .name = "[uprobes]", + .fault = xol_fault, +}; + /* Slot allocation for XOL */ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) { @@ -1459,7 +1473,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, - &area->xol_mapping); + &xol_mapping); if (IS_ERR(vma)) { ret = PTR_ERR(vma); goto fail; @@ -1498,9 +1512,6 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) if (!area->bitmap) goto free_area; - area->xol_mapping.name = "[uprobes]"; - area->xol_mapping.fault = NULL; - area->xol_mapping.pages = area->pages; area->pages[0] = alloc_page(GFP_HIGHUSER); if (!area->pages[0]) goto free_bitmap; From 2abbcc099ec60844ca7c15214ab12955d3c11e68 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Wed, 11 Sep 2024 15:14:37 +0200 Subject: [PATCH 415/416] uprobes: turn xol_area->pages[2] into xol_area->page Now that xol_mapping has its own ->fault() method we no longer need xol_area->pages[1] == NULL, we need a single page. Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com Signed-off-by: Oleg Nesterov Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Andrii Nakryiko Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Linus Torvalds Cc: Mark Rutland Cc: Masami Hiramatsu Cc: Michael Ellerman Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Sven Schnelle Signed-off-by: Andrew Morton --- kernel/events/uprobes.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index a478e028043f..94a17435cfe6 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -99,7 +99,7 @@ struct xol_area { atomic_t slot_count; /* number of in-use slots */ unsigned long *bitmap; /* 0 = free slot */ - struct page *pages[2]; + struct page *page; /* * We keep the vma's vm_start rather than a pointer to the vma * itself. The probed process or a naughty kernel module could make @@ -1437,7 +1437,7 @@ static vm_fault_t xol_fault(const struct vm_special_mapping *sm, { struct xol_area *area = vma->vm_mm->uprobes_state.xol_area; - vmf->page = area->pages[0]; + vmf->page = area->page; get_page(vmf->page); return 0; } @@ -1512,10 +1512,9 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) if (!area->bitmap) goto free_area; - area->pages[0] = alloc_page(GFP_HIGHUSER); - if (!area->pages[0]) + area->page = alloc_page(GFP_HIGHUSER); + if (!area->page) goto free_bitmap; - area->pages[1] = NULL; area->vaddr = vaddr; init_waitqueue_head(&area->wq); @@ -1523,12 +1522,12 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) set_bit(0, area->bitmap); atomic_set(&area->slot_count, 1); insns = arch_uprobe_trampoline(&insns_size); - arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size); + arch_uprobe_copy_ixol(area->page, 0, insns, insns_size); if (!xol_add_vma(mm, area)) return area; - __free_page(area->pages[0]); + __free_page(area->page); free_bitmap: kfree(area->bitmap); free_area: @@ -1570,7 +1569,7 @@ void uprobe_clear_state(struct mm_struct *mm) if (!area) return; - put_page(area->pages[0]); + put_page(area->page); kfree(area->bitmap); kfree(area); } @@ -1637,7 +1636,7 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) if (unlikely(!xol_vaddr)) return 0; - arch_uprobe_copy_ixol(area->pages[0], xol_vaddr, + arch_uprobe_copy_ixol(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); return xol_vaddr; From 684826f8271ad97580b138b9ffd462005e470b99 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 11 Sep 2024 11:54:56 +0900 Subject: [PATCH 416/416] zram: free secondary algorithms names We need to kfree() secondary algorithms names when reset zram device that had multi-streams, otherwise we leak memory. [senozhatsky@chromium.org: kfree(NULL) is legal] Link: https://lkml.kernel.org/r/20240917013021.868769-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240911025600.3681789-1-senozhatsky@chromium.org Fixes: 001d92735701 ("zram: add recompression algorithm sysfs knob") Signed-off-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Signed-off-by: Andrew Morton --- drivers/block/zram/zram_drv.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 1f1bf175a6c3..0207a7fc0a97 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -2112,6 +2112,11 @@ static void zram_destroy_comps(struct zram *zram) zram->num_active_comps--; } + for (prio = ZRAM_SECONDARY_COMP; prio < ZRAM_MAX_COMPS; prio++) { + kfree(zram->comp_algs[prio]); + zram->comp_algs[prio] = NULL; + } + zram_comp_params_reset(zram); }