7998df0b64
431 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Matthew Wilcox (Oracle)
|
e0abfbb671 |
mm: rename vma_pgoff_address back to vma_address
With all callers converted, we can use the nice shorter name. Take this opportunity to reorder the arguments to the logical order (larger object first). Link: https://lkml.kernel.org/r/20240328225831.1765286-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
412ad5fbe9 |
mm: remove vma_address()
Convert the three remaining callers to call vma_pgoff_address() directly. This removes an ambiguity where we'd check just one page if passed a tail page and all N pages if passed a head page. Also add better kernel-doc for vma_pgoff_address(). Link: https://lkml.kernel.org/r/20240328225831.1765286-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
York Jasper Niebuhr
|
ba42b524a0 |
mm: init_mlocked_on_free_v3
Implements the "init_mlocked_on_free" boot option. When this boot option
is enabled, any mlock'ed pages are zeroed on free. If
the pages are munlock'ed beforehand, no initialization takes place.
This boot option is meant to combat the performance hit of
"init_on_free" as reported in commit
|
||
Peter Xu
|
c0bff412e6 |
mm: allow anon exclusive check over hugetlb tail pages
PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.
As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check. Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.
There's already a sign of it when we look at the GUP-fast which already
contain the hugetlb processing altogether: we used to have a specific
commit
|
||
Peter Xu
|
4418c522f6 |
mm/gup: handle huge pmd for follow_pmd_mask()
Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long as enabled. FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge. Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it into follow_huge_pmd() to match what it does. Move it into gup.c so not depend on CONFIG_THP. When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set it when the page is valid. It was not a bug to set it before even if GUP failed (page==NULL), because follow_page_mask() callers always ignores page_mask if so. But doing so makes the code cleaner. [peterx@redhat.com: allow follow_pmd_mask() to take hugetlb tail pages] Link: https://lkml.kernel.org/r/20240403013249.1418299-3-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
1b16761802 |
mm/gup: handle huge pud for follow_pud_mask()
Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb. Rename follow_devmap_pud() to follow_huge_pud() so that it can process either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and and huge_memory.c (which relies on CONFIG_THP). Switch to pud_leaf() to detect both cases in the slow gup. In the new follow_huge_pud(), taking care of possible CoR for hugetlb if necessary. touch_pud() needs to be moved out of huge_memory.c to be accessable from gup.c even if !THP. Since at it, optimize the non-present check by adding a pud_present() early check before taking the pgtable lock, failing the follow_page() early if PUD is not present: that is required by both devmap or hugetlb. Use pud_huge() to also cover the pud_devmap() case. One more trivial thing to mention is, introduce "pud_t pud" in the code paths along the way, so the code doesn't dereference *pudp multiple time. Not only because that looks less straightforward, but also because if the dereference really happened, it's not clear whether there can be race to see different *pudp values when it's being modified at the same time. Setting ctx->page_mask properly for a PUD entry. As a side effect, this patch should also be able to optimize devmap GUP on PUD to be able to jump over the whole PUD range, but not yet verified. Hugetlb already can do so prior to this patch. Link: https://lkml.kernel.org/r/20240327152332.950956-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
b84fd2835c |
mm: make page_mapped() take a const argument
None of the functions called by page_mapped() modify the page or folio, so mark them all as const. Link: https://lkml.kernel.org/r/20240326171045.410737-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yajun Deng
|
d4e6b397be |
mm/mmap: convert all mas except mas_detach to vma iterator
There are two types of iterators mas and vmi in the current code. If the maple tree comes from the mm structure, we can use the vma iterator. Avoid using mas directly as possible. Keep using mas for the mt_detach tree, since it doesn't come from the mm structure. Remove as many uses of mas as possible, but we will still have a few that must be passed through in unmap_vmas() and free_pgtables(). Also introduce vma_iter_reset, vma_iter_{prev, next}_range_limit and vma_iter_area_{lowest, highest} helper functions for using the vma interator. Link: https://lkml.kernel.org/r/20240325063258.1437618-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Tested-by: Helge Deller <deller@gmx.de> [parisc] Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Barry Song
|
f238b8c33c |
arm64: mm: swap: support THP_SWAP on hardware with MTE
Commit
|
||
Baolin Wang
|
e42dfe4e0a |
mm: record the migration reason for struct migration_target_control
Patch series "make the hugetlb migration strategy consistent", v2. As discussed in previous thread [1], there is an inconsistency when handling hugetlb migration. When handling the migration of freed hugetlb, it prevents fallback to other NUMA nodes in alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use hugetlb, it allows fallback to other NUMA nodes in alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool and might result in unexpected failures when node bound workloads doesn't get what is asssumed available. This patchset tries to make the hugetlb migration strategy more clear and consistent. Please find details in each patch. [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/ This patch (of 2): To support different hugetlb allocation strategies during hugetlb migration based on various migration reasons, record the migration reason in the migration_target_control structure as a preparation. Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Johannes Weiner
|
e0932b6c1f |
mm: page_alloc: consolidate free page accounting
Free page accounting currently happens a bit too high up the call stack, where it has to deal with guard pages, compaction capturing, block stealing and even page isolation. This is subtle and fragile, and makes it difficult to hack on the code. Now that type violations on the freelists have been fixed, push the accounting down to where pages enter and leave the freelist. [hannes@cmpxchg.org: undo unrelated drive-by line wrap] Link: https://lkml.kernel.org/r/20240327185736.GA7597@cmpxchg.org [hannes@cmpxchg.org: remove unused page parameter from account_freepages()] Link: https://lkml.kernel.org/r/20240327185831.GB7597@cmpxchg.org [baolin.wang@linux.alibaba.com: fix free page accounting] Link: https://lkml.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang@linux.alibaba.com [andriy.shevchenko@linux.intel.com: avoid defining unused function] Link: https://lkml.kernel.org/r/20240423161506.2637177-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20240320180429.678181-11-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Johannes Weiner
|
fd919a85cd |
mm: page_isolation: prepare for hygienic freelists
Page isolation currently sets MIGRATE_ISOLATE on a block, then drops zone->lock and scans the block for straddling buddies to split up. Because this happens non-atomically wrt the page allocator, it's possible for allocations to get a buddy whose first block is a regular pcp migratetype but whose tail is isolated. This means that in certain cases memory can still be allocated after isolation. It will also trigger the freelist type hygiene warnings in subsequent patches. start_isolate_page_range() isolate_single_pageblock() set_migratetype_isolate(tail) lock zone->lock move_freepages_block(tail) // nop set_pageblock_migratetype(tail) unlock zone->lock __rmqueue_smallest() del_page_from_freelist(head) expand(head, head_mt) WARN(head_mt != tail_mt) start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES) for (pfn = start_pfn, pfn < end_pfn) if (PageBuddy()) split_free_page(head) Introduce a variant of move_freepages_block() provided by the allocator specifically for page isolation; it moves free pages, converts the block, and handles the splitting of straddling buddies while holding zone->lock. The allocator knows that pageblocks and buddies are always naturally aligned, which means that buddies can only straddle blocks if they're actually >pageblock_order. This means the search-and-split part can be simplified compared to what page isolation used to do. Also tighten up the page isolation code around the expectations of which pages can be large, and how they are freed. Based on extensive discussions with and invaluable input from Zi Yan. [hannes@cmpxchg.org: work around older gcc warning] Link: https://lkml.kernel.org/r/20240321142426.GB777580@cmpxchg.org Link: https://lkml.kernel.org/r/20240320180429.678181-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
85edc15a4c |
mm: remove folio_prep_large_rmappable()
Now that prep_compound_page() initialises folio->_deferred_list, folio_prep_large_rmappable()'s only purpose is to set the large_rmappable flag, so inline it into the two callers. Take the opportunity to convert the large_rmappable definition from PAGEFLAG to FOLIO_FLAG and remove the existance of PageTestLargeRmappable and friends. Link: https://lkml.kernel.org/r/20240321142448.1645400-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
b7b098cf00 |
mm: always initialise folio->_deferred_list
Patch series "Various significant MM patches". These patches all interact in annoying ways which make it tricky to send them out in any way other than a big batch, even though there's not really an overarching theme to connect them. The big effects of this patch series are: - folio_test_hugetlb() becomes reliable, even when called without a page reference - We free up PG_slab, and we could always use more page flags - We no longer need to check PageSlab before calling page_mapcount() This patch (of 9): For compound pages which are at least order-2 (and hence have a deferred_list), initialise it and then we can check at free that the page is not part of a deferred list. We recently found this useful to rule out a source of corruption. [peterx@redhat.com: always initialise folio->_deferred_list] Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Donet Tom
|
f8fd525ba3 |
mm/mempolicy: use numa_node_id() instead of cpu_to_node()
Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy:, v4. This patchset is to optimize the cross-socket memory access with MPOL_PREFERRED_MANY policy. To test this patch we ran the following test on a 3 node system. Node 0 - 2GB - Tier 1 Node 1 - 11GB - Tier 1 Node 6 - 10GB - Tier 2 Below changes are made to memcached to set the memory policy, It select Node0 and Node1 as preferred nodes. #include <numaif.h> #include <numa.h> unsigned long nodemask; int ret; nodemask = 0x03; ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING, &nodemask, 10); /* If MPOL_F_NUMA_BALANCING isn't supported, * fall back to MPOL_PREFERRED_MANY */ if (ret < 0 && errno == EINVAL){ printf("set mem policy normal\n"); ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10); } if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } Test Procedure: =============== 1. Make sure memory tiring and demotion are enabled. 2. Start memcached. # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7 -d -s "/tmp/memcached.sock" 3. Run memtier_benchmark to store 3200000 keys. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1 --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024 4. Start a memory eater on node 0 and 1. This will demote all memcached pages to node 6. 5. Make sure all the memcached pages got demoted to lower tier by reading /proc/<memcaced PID>/numa_maps. # cat /proc/2771/numa_maps --- default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64 --- 6. Kill memory eater. 7. Read the pgpromote_success counter. 8. Start reading the keys by running memtier_benchmark. #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R --key-minimum=1 --key-maximum=3200000 -n allkeys --threads=64 -c 1 -R -x 6 9. Read the pgpromote_success counter. Test Results: ============= Without Patch ------------------ 1. pgpromote_success before test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 pgpromote_success after test Node 0: pgpromote_success 11 Node 1: pgpromote_success 140974 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency ------------------------------------------------------------------ Sets 0.00 --- --- --- --- Gets 305792.03 305791.93 0.10 0.18949 0.16700 Waits 0.00 --- --- --- --- Totals 305792.03 305791.93 0.10 0.18949 0.16700 ====================================== p99 Latency p99.9 Latency KB/sec ------------------------------------- --- --- 0.00 0.44700 1.71100 11542.69 --- --- --- 0.44700 1.71100 11542.69 With Patch --------------- 1. pgpromote_success before test Node 0: pgpromote_success 5 Node 1: pgpromote_success 89386 pgpromote_success after test Node 0: pgpromote_success 57895 Node 1: pgpromote_success 141463 2. Memtier-benchmark result. AGGREGATED AVERAGE RESULTS (6 runs) ==================================================================== Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency -------------------------------------------------------------------- Sets 0.00 --- --- --- --- Gets 521942.24 521942.07 0.17 0.11459 0.10300 Waits 0.00 --- --- --- --- Totals 521942.24 521942.07 0.17 0.11459 0.10300 ======================================= p99 Latency p99.9 Latency KB/sec --------------------------------------- --- --- 0.00 0.23100 0.31900 19701.68 --- --- --- 0.23100 0.31900 19701.68 Test Result Analysis: ===================== 1. With patch we could observe pages are getting promoted. 2. Memtier-benchmark results shows that, with the patch, performance has increased more than 50%. Ops/sec without fix - 305792.03 Ops/sec with fix - 521942.24 This patch (of 2): Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is quicker. smp_processor_id is guaranteed to be stable in the 'mpol_misplaced()' function because it is called with ptl held. lockdep_assert_held was added to ensure that. No functional change in this patch. [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment] Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
David Hildenbrand
|
631426ba1d |
mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.
A reproducer can be found at [1].
The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.
So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().
But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.
As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.
With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
Fixes:
|
||
Linus Torvalds
|
902861e34c |
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx TMNhHfyiHYDTx/GAV9NXW84tasJSDgA= =TG55 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ... |
||
Barry Song
|
ac96cc4d1c |
mm: make folio_pte_batch available outside of mm/memory.c
madvise, mprotect and some others might need folio_pte_batch to check if a range of PTEs are completely mapped to a large folio with contiguous physical addresses. Let's make it available in mm/internal.h. While at it, add proper kernel doc and sanity-check more input parameters using two additional VM_WARN_ON_FOLIO(). [21cnbao@gmail.com: build fix] Link: https://lkml.kernel.org/r/CAGsJ_4wWzG-37D82vqP_zt+Fcbz+URVe5oXLBc4M5wbN8A_gpQ@mail.gmail.com [david@redhat.com: improve the doc for the exported func] Link: https://lkml.kernel.org/r/20240227104201.337988-1-21cnbao@gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Richard Chang
|
c8b3600312 |
mm: add alloc_contig_migrate_range allocation statistics
alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org. Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
8b7b0a5eee |
mm: remove free_unref_page_list()
All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
90491d87dd |
mm: add free_unref_folios()
Iterate over a folio_batch rather than a linked list. This is easier for the CPU to prefetch and has a batch count naturally built in so we don't need to track it. Again, this lowers the maximum lock hold time from 32 folios to 15, but I do not expect this to have a significant effect. Link: https://lkml.kernel.org/r/20240227174254.710559-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
8897277acf |
mm: support order-1 folios in the page cache
Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios. Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/ Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Barry Song
|
2864f3d0f5 |
mm: madvise: pageout: ignore references rather than clearing young
While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 512*1024*1024 volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
997f0ecb11 |
mm/memory: change vmf_anon_prepare() to be non-static
Patch series "Handle hugetlb faults under the VMA lock", v2. It is generally safe to handle hugetlb faults under the VMA lock. The only time this is unsafe is when no anon_vma has been allocated to this vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to bailout if necessary. This should only happen for the first hugetlb page in the vma. Additionally, this patchset begins to use struct vm_fault within hugetlb_fault(). This works towards cleaning up hugetlb code, and should significantly reduce the number of arguments passed to functions. The last patch in this series may cause ltp hugemmap10 to "fail". This is because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. The rest of the ltp testcases pass. This patch (of 2): In order to handle hugetlb faults under the VMA lock, hugetlb can use vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change it to be a non-static function so it can be used within hugetlb as well. Link: https://lkml.kernel.org/r/20240221234732.187629-6-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20240221234732.187629-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
733aea0b3a |
mm/compaction: add support for >0 order folio memory compaction.
Before last commit, memory compaction only migrates order-0 folios and skips >0 order folios. Last commit splits all >0 order folios during compaction. This commit migrates >0 order folios during compaction by keeping isolated free pages at their original size without splitting them into order-0 pages and using them directly during migration process. What is different from the prior implementation: 1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page lists, where each page list stores free pages in the same order. 2. All free pages are not post_alloc_hook() processed nor buddy pages, although their orders are stored in first page's private like buddy pages. 3. During migration, in new page allocation time (i.e., in compaction_alloc()), free pages are then processed by post_alloc_hook(). When migration fails and a new page is returned (i.e., in compaction_free()), free pages are restored by reversing the post_alloc_hook() operations using newly added free_pages_prepare_fpi_none(). Step 3 is done for a latter optimization that splitting and/or merging free pages during compaction becomes easier. Note: without splitting free pages, compaction can end prematurely due to migration will return -ENOMEM even if there is free pages. This happens when no order-0 free page exist and compaction_alloc() return NULL. Link: https://lkml.kernel.org/r/20240220183220.1451315-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yajun Deng
|
412c6ef986 |
mm/mmap: introduce vma_set_range()
There is a lot of code needs to set the range of vma in mmap.c, introduce vma_set_range() to simplify the code. Link: https://lkml.kernel.org/r/20240124035719.3685193-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Christoph Hellwig
|
b64e74e95a |
mm: move mapping_set_update out of <linux/swap.h>
mapping_set_update is only used inside mm/. Move mapping_set_update to mm/internal.h and turn it into an inline function instead of a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> |
||
Kirill A. Shutemov
|
5e0a760b44 |
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit
|
||
David Hildenbrand
|
4a8ffab02d |
mm: remove one last reference to page_add_*_rmap()
Let's fixup one remaining comment. Note that the only trace remaining of the old rmap interface is in an example in Documentation/trace/ftrace.rst, that we'll just leave alone. Link: https://lkml.kernel.org/r/20231220224504.646757-41-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
David Hildenbrand
|
e78a13fd16 |
mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED
We removed all "bool compound" and RMAP_COMPOUND parameters. Let's remove the remaining "compound" terminology by making COMPOUND_MAPPED match the "folio->_entire_mapcount" terminology, renaming it to ENTIRELY_MAPPED. ENTIRELY_MAPPED is only used when the whole folio is mapped using a single page table entry (e.g., a single PMD mapping a PMD-sized THP). For now, we don't support mapping any THP bigger than that, so ENTIRELY_MAPPED only applies to PMD-mapped PMD-sized THP only. Link: https://lkml.kernel.org/r/20231220224504.646757-40-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
David Hildenbrand
|
e3b4b1374f |
mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]()
Let's convert it like we converted all the other rmap functions. Don't introduce folio_try_share_anon_rmap_ptes() for now, as we don't have a user that wants rmap batching in sight. Pretty easy to add later. All users are easy to convert -- only ksm.c doesn't use folios yet but that is left for future work -- so let's just do it in a single shot. While at it, turn the BUG_ON into a WARN_ON_ONCE. Note that page_try_share_anon_rmap() so far didn't care about pte/pmd mappings (no compound parameter). We're changing that so we can perform better sanity checks and make the code actually more readable/consistent. For example, __folio_rmap_sanity_checks() will make sure that a PMD range actually falls completely into the folio. Link: https://lkml.kernel.org/r/20231220224504.646757-39-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
David Hildenbrand
|
4d8f7418e8 |
mm/rmap: remove page_remove_rmap()
All callers are gone, let's remove it and some leftover traces. Link: https://lkml.kernel.org/r/20231220224504.646757-33-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Chen Haonan
|
dd05f5ec1e |
mm: use vma_pages() for vma objects
vma_pages() is more readable and also better at avoiding error codes, so use vma_pages() instead of direct operations on vma Link: https://lkml.kernel.org/r/tencent_151850CF327EB055BBC83298A929BD06CD0A@qq.com Signed-off-by: Chen Haonan <chen.haonan2@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
067311d33e |
maple_tree: separate ma_state node from status
The maple tree node is overloaded to keep status as well as the active node. This, unfortunately, results in a re-walk on underflow or overflow. Since the maple state has room, the status can be placed in its own enum in the structure. Once an underflow/overflow is detected, certain modes can restore the status to active and others may need to re-walk just that one node to see the entry. The status being an enum has the benefit of detecting unhandled status in switch statements. [Liam.Howlett@oracle.com: fix comments about MAS_*] Link: https://lkml.kernel.org/r/20231106154124.614247-1-Liam.Howlett@oracle.com [Liam.Howlett@oracle.com: update forking to separate maple state and node] Link: https://lkml.kernel.org/r/20231106154551.615042-1-Liam.Howlett@oracle.com [Liam.Howlett@oracle.com: fix mas_prev() state separation code] Link: https://lkml.kernel.org/r/20231207193319.4025462-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20231101171629.3612299-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
bf857ddd21 |
maple_tree: move debug check to __mas_set_range()
__mas_set_range() was created to shortcut resetting the maple state and a debug check was added to the caller (the vma iterator) to ensure the internal maple state remains safe to use. Move the debug check from the vma iterator into the maple tree itself so other users do not incorrectly use the advanced maple state modification. Fallout from this change include a large amount of debug setup needed to be moved to earlier in the header, and the maple_tree.h radix-tree test code needed to move the inclusion of the header to after the atomic define. None of those changes have functional changes. Link: https://lkml.kernel.org/r/20231101171629.3612299-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
2033c98cce |
mm: remove invalidate_inode_page()
All callers are now converted to call mapping_evict_folio(). Link: https://lkml.kernel.org/r/20231108182809.602073-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
1e12cbb9f6 |
mm: make mapping_evict_folio() the preferred way to evict clean folios
Patch series "Fix fault handler's handling of poisoned tail pages". Since introducing the ability to have large folios in the page cache, it's been possible to have a hwpoisoned tail page returned from the fault handler. We handle this situation poorly; failing to remove the affected page from use. This isn't a minimal patch to fix it, it's a full conversion of all the code surrounding it. This patch (of 6): invalidate_inode_page() does very little beyond calling mapping_evict_folio(). Move the check for mapping being NULL into mapping_evict_folio() and make it available to the rest of the MM for use in the next few patches. Link: https://lkml.kernel.org/r/20231108182809.602073-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231108182809.602073-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peng Zhang
|
d240629148 |
fork: use __mt_dup() to duplicate maple tree in dup_mmap()
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
23e4883248 |
mm: add page_rmappable_folio() wrapper
folio_prep_large_rmappable() is being used repeatedly along with a conversion from page to folio, a check non-NULL, a check order > 1: wrap it all up into struct folio *page_rmappable_folio(struct page *). Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muhammad Muzammil
|
be16dd764a |
mm: fix multiple typos in multiple files
Link: https://lkml.kernel.org/r/20231023124405.36981-1-m.muzzammilashraf@gmail.com Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
93bf5d4aa2 |
mm: abstract VMA merge and extend into vma_merge_extend() helper
mremap uses vma_merge() in the case where a VMA needs to be extended. This can be significantly simplified and abstracted. This makes it far easier to understand what the actual function is doing, avoids future mistakes in use of the confusing vma_merge() function and importantly allows us to make future changes to how vma_merge() is implemented by knowing explicitly which merge cases each invocation uses. Note that in the mremap() extend case, we perform this merge only when old_len == vma->vm_end - addr. The extension_start, i.e. the start of the extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end. With this refactoring, vma_merge() is no longer required anywhere except mm/mmap.c, so mark it static. Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
adb20b0c78 |
mm: make vma_merge() and split_vma() internal
Now the common pattern of - attempting a merge via vma_merge() and should this fail splitting VMAs via split_vma() - has been abstracted, the former can be placed into mm/internal.h and the latter made static. In addition, the split_vma() nommu variant also need not be exported. Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lucy Mielke
|
f04eba134e |
mm: add printf attribute to shrinker_debugfs_name_alloc
This fixes a compiler warning when compiling an allyesconfig with W=1:
mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf'
format attribute [-Werror=suggest-attribute=format]
[akpm@linux-foundation.org: fix shrinker_alloc() as welll per Qi Zheng]
Link: https://lkml.kernel.org/r/822387b7-4895-4e64-5806-0f56b5d6c447@bytedance.com
Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe
Fixes:
|
||
Matthew Wilcox (Oracle)
|
2580d55458 |
mm: use folio_xor_flags_has_waiters() in folio_end_writeback()
Match how folio_unlock() works by combining the test for PG_waiters with the clearing of PG_writeback. This should have a small performance win, and removes the last user of folio_wake(). Link: https://lkml.kernel.org/r/20231004165317.1061855-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
7d0795d098 |
mm: make __end_folio_writeback() return void
Rather than check the result of test-and-clear, just check that we have the writeback bit set at the start. This wouldn't catch every case, but it's good enough (and enables the next patch). Link: https://lkml.kernel.org/r/20231004165317.1061855-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
0f20bba168 |
mm/gup: explicitly define and check internal GUP flags, disallow FOLL_TOUCH
Rather than open-coding a list of internal GUP flags in is_valid_gup_args(), define which ones are internal. In addition, explicitly check to see if the user passed in FOLL_TOUCH somehow, as this appears to have been accidentally excluded. Link: https://lkml.kernel.org/r/971e013dfe20915612ea8b704e801d7aef9a66b6.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
cda6d93672 |
mm: memory: make numa_migrate_prep() to take a folio
In preparation for large folio numa balancing, make numa_migrate_prep() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yin Fengwei
|
1acbc3f936 |
mm: handle large folio when large folio in VM_LOCKED VMA range
If large folio is in the range of VM_LOCKED VMA, it should be mlocked to avoid being picked by page reclaim. Which may split the large folio and then mlock each pages again. Mlock this kind of large folio to prevent them being picked by page reclaim. For the large folio which cross the boundary of VM_LOCKED VMA or not fully mapped to VM_LOCKED VMA, we'd better not to mlock it. So if the system is under memory pressure, this kind of large folio will be split and the pages ouf of VM_LOCKED VMA can be reclaimed. Ideally, for large folio, we should mlock it when the large folio is fully mapped to VMA and munlock it if any page are unmampped from VMA. But it's not easy to detect whether the large folio is fully mapped to VMA in some cases (like add/remove rmap). So we update mlock_vma_folio() and munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags. Let caller to decide whether they should call these two functions. For add rmap, only mlock normal 4K folio and postpone large folio handling to page reclaim phase. It is possible to reuse page table iterator to detect whether folio is fully mapped or not during page reclaim phase. For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly because rmap makes folio not fully mapped to VMA. Link: https://lkml.kernel.org/r/20230918073318.1181104-3-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yin Fengwei
|
28e566572a |
mm: add functions folio_in_range() and folio_within_vma()
Patch series "support large folio for mlock", v3. Yu mentioned at [1] about the mlock() can't be applied to large folio. I leant the related code and here is my understanding: - For RLIMIT_MEMLOCK related, there is no problem. Because the RLIMIT_MEMLOCK statistics is not related underneath page. That means underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK statistics collection which is always correct. - For keeping the page in RAM, there is no problem either. At least, during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit set in vm_flags, the folio will be kept whatever the folio is mlocked or not. So the function of mlock for large folio works. But it's not optimized because the page reclaim needs scan these large folio and may split them. This series identified the large folio for mlock to four types: - The large folio is in VM_LOCKED range and fully mapped to the range - The large folio is in the VM_LOCKED range but not fully mapped to the range - The large folio cross VM_LOCKED VMA boundary - The large folio cross last level page table boundary For the first type, we mlock large folio so page reclaim will skip it. For the second/third type, we don't mlock large folio. As the pages not mapped to VM_LOACKED range are mapped to none VM_LOCKED range, if system is in memory pressure situation, the large folio can be picked by page reclaim and split. Then the pages not mapped to VM_LOCKED range can be reclaimed. For the fourth type, we don't mlock large folio because locking one page table lock can't prevent the part in another last level page table being unmapped. Thanks to Ryan for pointing this out. To check whether the folio is fully mapped to the range, PTEs needs be checked to see whether the page of folio is associated. Which needs take page table lock and is heavy operation. So far, the only place needs this check is madvise and page reclaim. These functions already have their own PTE iterator. patch1 introduce API to check whether large folio is in VMA range. patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support large folio mlock/munlock. patch3 make mlock/munlock syscall support large folio. Yu also mentioned a race which can make folio unevictable after munlock during RFC v2 discussion [3]: We decided that race issue didn't block this series based on: - That race issue was not introduced by this series - We had a looks-ok fix for that race issue. Need to wait for mlock_count fixing patch as Yosry Ahmed suggested [4] [1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/ [3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/ This patch (of 3): folio_in_range() will be used to check whether the folio is mapped to specific VMA and whether the mapping address of folio is in the range. Also a helper function folio_within_vma() to check whether folio is in the range of vma based on folio_in_range(). Link: https://lkml.kernel.org/r/20230918073318.1181104-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230918073318.1181104-2-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Usama Arif
|
fde1c4ecf9 |
mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO
The new boot flow when it comes to initialization of gigantic pages is as follows: - At boot time, for a gigantic page during __alloc_bootmem_hugepage, the region after the first struct page is marked as noinit. - This results in only the first struct page to be initialized in reserve_bootmem_region. As the tail struct pages are not initialized at this point, there can be a significant saving in boot time if HVO succeeds later on. - Later on in the boot, the head page is prepped and the first HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized. - HVO is attempted. If it is not successful, then the rest of the tail struct pages are initialized. If it is successful, no more tail struct pages need to be initialized saving significant boot time. The WARN_ON for increased ref count in gather_bootmem_prealloc was changed to a VM_BUG_ON. This is OK as there should be no speculative references this early in boot process. The VM_BUG_ON's are there just in case such code is introduced. [akpm@linux-foundation.org: make it nicer for 80 cols] Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.com Signed-off-by: Usama Arif <usama.arif@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |