aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-03-05page_frag: unify gfp bits for order 3 page allocationGravatar Yunsheng Lin 1-2/+2
Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05mm/page_alloc: modify page_frag_alloc_align() to accept align as an argumentGravatar Yunsheng Lin 1-4/+4
napi_alloc_frag_align() and netdev_alloc_frag_align() accept align as an argument, and they are thin wrappers around the __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs doing the alignment checking and align mask conversion, in order to call page_frag_alloc_align() directly. The intention here is to keep the alignment checking and the alignmask conversion in in-line wrapper to avoid those kind of operations during execution time since it can usually be handled during compile time. We are going to use page_frag_alloc_align() in vhost_net.c, it need the same kind of alignment checking and alignmask conversion, so split up page_frag_alloc_align into an inline wrapper doing the above operation, and add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05slab: remove PARTIAL_NODE slab_stateGravatar Chengming Zhou 1-1/+0
The PARTIAL_NODE slab_state has gone with SLAB removed, so just remove it. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-03-04mm/zsmalloc: don't need to reserve LSB in handleGravatar Chengming Zhou 1-4/+1
We will save allocated tag in the object header to indicate that it's allocated. handle |= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/20240228023854.3511239-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memory.c: do_numa_page(): remove a redundant page table readGravatar John Hubbard 1-6/+6
do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte. Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()). Also improve the associated comments nearby. Link: https://lkml.kernel.org/r/20240228034151.459370-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: add alloc_contig_migrate_range allocation statisticsGravatar Richard Chang 3-7/+30
alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org. Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: use folio more widely in __split_huge_pageGravatar Matthew Wilcox (Oracle) 1-10/+11
We already have a folio; use it instead of the head page where reasonable. Saves a couple of calls to compound_head() and elimimnates a few references to page->mapping. Link: https://lkml.kernel.org/r/20240228164326.1355045-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: convert free_swap_cache() to take a folioGravatar Matthew Wilcox (Oracle) 3-8/+8
All but one caller already has a folio, so convert free_page_and_swap_cache() to have a folio and remove the call to page_folio(). Link: https://lkml.kernel.org/r/20240227174254.710559-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: use a folio in __collapse_huge_page_copy_succeeded()Gravatar Matthew Wilcox (Oracle) 1-16/+14
These pages are all chained together through the lru list, so we know they're folios. Use the folio APIs to save three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20240227174254.710559-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: convert free_pages_and_swap_cache() to use folios_put()Gravatar Matthew Wilcox (Oracle) 1-8/+13
Process the pages in batch-sized quantities instead of all-at-once. Link: https://lkml.kernel.org/r/20240227174254.710559-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: remove free_unref_page_list()Gravatar Matthew Wilcox (Oracle) 2-19/+0
All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04memcg: remove mem_cgroup_uncharge_list()Gravatar Matthew Wilcox (Oracle) 1-19/+0
All users have been converted to mem_cgroup_uncharge_folios() so we can remove this API. Link: https://lkml.kernel.org/r/20240227174254.710559-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: free folios directly in move_folios_to_lru()Gravatar Matthew Wilcox (Oracle) 1-20/+12
The few folios which can't be moved to the LRU list (because their refcount dropped to zero) used to be returned to the caller to dispose of. Make this simpler to call by freeing the folios directly through free_unref_folios(). Link: https://lkml.kernel.org/r/20240227174254.710559-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: free folios in a batch in shrink_folio_list()Gravatar Matthew Wilcox (Oracle) 1-11/+9
Use free_unref_page_batch() to free the folios. This may increase the number of IPIs from calling try_to_unmap_flush() more often, but that's going to be very workload-dependent. It may even reduce the number of IPIs as we now batch-free large folios instead of freeing them one at a time. Link: https://lkml.kernel.org/r/20240227174254.710559-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: allow non-hugetlb large folios to be batch processedGravatar Matthew Wilcox (Oracle) 1-2/+3
Hugetlb folios still get special treatment, but normal large folios can now be freed by free_unref_folios(). This should have a reasonable performance impact, TBD. Link: https://lkml.kernel.org/r/20240227174254.710559-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: handle large folios in free_unref_folios()Gravatar Matthew Wilcox (Oracle) 1-8/+17
Call folio_undo_large_rmappable() if needed. free_unref_page_prepare() destroys the ability to call folio_order(), so stash the order in folio->private for the benefit of the second loop. Link: https://lkml.kernel.org/r/20240227174254.710559-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: use __page_cache_release() in folios_put()Gravatar Matthew Wilcox (Oracle) 1-33/+29
Pass a pointer to the lruvec so we can take advantage of the folio_lruvec_relock_irqsave(). Adjust the calling convention of folio_lruvec_relock_irqsave() to suit and add a page_cache_release() wrapper. Link: https://lkml.kernel.org/r/20240227174254.710559-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: use free_unref_folios() in put_pages_list()Gravatar Matthew Wilcox (Oracle) 1-7/+10
Break up the list of folios into batches here so that the folios are more likely to be cache hot when doing the rest of the processing. Link: https://lkml.kernel.org/r/20240227174254.710559-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: remove use of folio list from folios_put()Gravatar Matthew Wilcox (Oracle) 1-7/+12
Instead of putting the interesting folios on a list, delete the uninteresting one from the folio_batch. Link: https://lkml.kernel.org/r/20240227174254.710559-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04memcg: add mem_cgroup_uncharge_folios()Gravatar Matthew Wilcox (Oracle) 1-0/+13
Almost identical to mem_cgroup_uncharge_list(), except it takes a folio_batch instead of a list_head. Link: https://lkml.kernel.org/r/20240227174254.710559-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: use folios_put() in __folio_batch_release()Gravatar Matthew Wilcox (Oracle) 1-2/+1
There's no need to indirect through release_pages() and iterate over this batch of folios an extra time; we can just use the batch that we have. Link: https://lkml.kernel.org/r/20240227174254.710559-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: add free_unref_folios()Gravatar Matthew Wilcox (Oracle) 2-25/+39
Iterate over a folio_batch rather than a linked list. This is easier for the CPU to prefetch and has a batch count naturally built in so we don't need to track it. Again, this lowers the maximum lock hold time from 32 folios to 15, but I do not expect this to have a significant effect. Link: https://lkml.kernel.org/r/20240227174254.710559-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: convert free_unref_page_list() to use foliosGravatar Matthew Wilcox (Oracle) 1-18/+20
Most of its callees are not yet ready to accept a folio, but we know all of the pages passed in are actually folios because they're linked through ->lru. Link: https://lkml.kernel.org/r/20240227174254.710559-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: make folios_put() the basis of release_pages()Gravatar Matthew Wilcox (Oracle) 2-43/+60
Patch series "Rearrange batched folio freeing", v3. Other than the obvious "remove calls to compound_head" changes, the fundamental belief here is that iterating a linked list is much slower than iterating an array (5-15x slower in my testing). There's also an associated belief that since we iterate the batch of folios three times, we do better when the array is small (ie 15 entries) than we do with a batch that is hundreds of entries long, which only gives us the opportunity for the first pages to fall out of cache by the time we get to the end. It is possible we should increase the size of folio_batch. Hopefully the bots let us know if this introduces any performance regressions. This patch (of 3): By making release_pages() call folios_put(), we can get rid of the calls to compound_head() for the callers that already know they have folios. We can also get rid of the lock_batch tracking as we know the size of the batch is limited by folio_batch. This does reduce the maximum number of pages for which the lruvec lock is held, from SWAP_CLUSTER_MAX (32) to PAGEVEC_SIZE (15). I do not expect this to make a significant difference, but if it does, we can increase PAGEVEC_SIZE to 31. Link: https://lkml.kernel.org/r/20240227174254.710559-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240227174254.710559-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP checkGravatar Lance Yang 1-3/+3
Previously, we removed the mm from mm_slot and dropped mm_count if the MMF_THP_DISABLE flag was set. However, we didn't re-add the mm back after clearing the MMF_THP_DISABLE flag. Additionally, We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate(). Link: https://lkml.kernel.org/r/20240227035135.54593-1-ioworker0@gmail.com Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check") Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memfd: refactor memfd_tag_pins() and memfd_wait_for_pins()Gravatar David Hildenbrand 1-29/+18
Patch series "mm: remove total_mapcount()", v2. Let's remove the remaining user from mm/memfd.c so we can get rid of total_mapcount(). This patch (of 2): Both functions are the remaining users of total_mapcount(). Let's get rid of the calls by converting the code to folios. As it turns out, the code is unnecessarily complicated, especially: 1) We can query the number of pagecache references for a folio simply via folio_nr_pages(). This will handle other folio sizes in the future correctly. 2) The xas_set(xas, page->index + cache_count) call to increment the iterator for large folios is not required. Remove it. Further, simplify the XA_CHECK_SCHED check, counting each entry exactly once. Memfd pages can be swapped out when using shmem; leave xa_is_value() checks in place. Link: https://lkml.kernel.org/r/20240226141324.278526-1-david@redhat.com Link: https://lkml.kernel.org/r/20240226141324.278526-2-david@redhat.com Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: huge_memory: enable debugfs to split huge pages to any orderGravatar Zi Yan 1-12/+22
It is used to test split_huge_page_to_list_to_order for pagecache THPs. Also add test cases for split_huge_page_to_list_to_order via both debugfs. [ziy@nvidia.com: fix issue discovered with NFS] Link: https://lkml.kernel.org/r/262E4DAA-4A78-4328-B745-1355AE356A07@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-9-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Tested-by: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Aishwarya TCV <aishwarya.tcv@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: thp: split huge page to any lower order pagesGravatar Zi Yan 1-24/+83
To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again. Note: Anonymous order-1 folio is not supported because _deferred_list, which is used by partially mapped folios, is stored in subpage 2 and an order-1 folio only has subpage 0 and 1. File-backed order-1 folios are fine, since they do not use _deferred_list. [ziy@nvidia.com: fixup per discussion with Ryan] Link: https://lkml.kernel.org/r/494F48CD-1F0F-4CAD-884E-6D48F40AF990@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-8-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: page_owner: add support for splitting to any order in split page_ownerGravatar Zi Yan 3-7/+6
It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: memcg: make memcg huge page split support any order splitGravatar Zi Yan 3-8/+9
It sets memcg information for the pages after the split. A new parameter new_order is added to tell the order of subpages in the new page, always 0 for now. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-6-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/page_owner: use order instead of nr in split_page_owner()Gravatar Zi Yan 3-4/+5
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memcg: use order instead of nr in split_page_memcg()Gravatar Zi Yan 3-5/+7
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: support order-1 folios in the page cacheGravatar Matthew Wilcox (Oracle) 4-11/+16
Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios. Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/ Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/huge_memory: only split PMD mapping when necessary in unmap_folio()Gravatar Zi Yan 1-2/+5
Patch series "Split a folio to any lower order folios", v5. File folio supports any order and multi-size THP is upstreamed[1], so both file and anonymous folios can be >0 order. Currently, split_huge_page() only splits a huge page to order-0 pages, but splitting to orders higher than 0 might better utilize large folios, if done properly. In addition, Large Block Sizes in XFS support would benefit from it during truncate[2]. This patchset adds support for splitting a large folio to any lower order folios. In addition to this implementation of split_huge_page_to_list_to_order(), a possible optimization could be splitting a large folio to arbitrary smaller folios instead of a single order. As both Hugh and Ryan pointed out [3,5] that split to a single order might not be optimal, an order-9 folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and 2 order-0 folios, depending on subsequent folio operations. Leave this as future work. [1] https://lore.kernel.org/all/20231207161211.2374093-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20240226094936.2677493-1-kernel@pankajraghav.com/ [3] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google.com/ [4] https://lore.kernel.org/linux-mm/cbb1d6a0-66dd-47d0-8733-f836fe050374@arm.com/ [5] https://lore.kernel.org/linux-mm/20240213215520.1048625-1-zi.yan@sent.com/ This patch (of 8): As multi-size THP support is added, not all THPs are PMD-mapped, thus during a huge page split, there is no need to always split PMD mapping in unmap_folio(). Make it conditional. Link: https://lkml.kernel.org/r/20240226205534.1603748-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240226205534.1603748-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: madvise: pageout: ignore references rather than clearing youngGravatar Barry Song 4-11/+13
While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 512*1024*1024 volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04kasan: fix a2 allocation and remove explicit cast in atomic testsGravatar Paul Heidekrüger 1-3/+3
Address the additional feedback since 4e76c8cc3378 kasan: add atomic tests (""kasan: add atomic tests") by removing an explicit cast and fixing the size as well as the check of the allocation of `a2`. Link: https://lkml.kernel.org/r/20240224105414.211995-1-paul.heidekrueger@tum.de Link: https://lore.kernel.org/all/20240131210041.686657-1-paul.heidekrueger@tum.de/T/#u Fixes: 4e76c8cc3378 ("kasan: add atomic tests") Signed-off-by: Paul Heidekrüger <paul.heidekrueger@tum.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055 Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: update mark_victim tracepoints fieldsGravatar Carlos Galo 1-1/+5
The current implementation of the mark_victim tracepoint provides only the process ID (pid) of the victim process. This limitation poses challenges for userspace tools requiring real-time OOM analysis and intervention. Although this information is available from the kernel logs, it’s not the appropriate format to provide OOM notifications. In Android, BPF programs are used with the mark_victim trace events to notify userspace of an OOM kill. For consistency, update the trace event to include the same information about the OOMed victim as the kernel logs. - UID In Android each installed application has a unique UID. Including the `uid` assists in correlating OOM events with specific apps. - Process Name (comm) Enables identification of the affected process. - OOM Score Will allow userspace to get additional insight of the relative kill priority of the OOM victim. In Android, the oom_score_adj is used to categorize app state (foreground, background, etc.), which aids in analyzing user-perceptible impacts of OOM events [1]. - Total VM, RSS Stats, and pgtables Amount of memory used by the victim that will, potentially, be freed up by killing it. [1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283 Signed-off-by: Carlos Galo <carlosgalo@google.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04hugetlb: allow faults to be handled under the VMA lockGravatar Vishal Moola (Oracle) 1-6/+0
Hugetlb can now safely handle faults under the VMA lock, so allow it to do so. This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb counters, and expects the counters to remain unchanged on failure to handle a fault. In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. Originally this could only happen due to OOM failures, but now it may also occur after we allocate a hugetlb folio without a suitable anon_vma under the VMA lock. This should only happen for the first freshly allocated hugepage in this vma. Link: https://lkml.kernel.org/r/20240221234732.187629-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()Gravatar Vishal Moola (Oracle) 1-9/+9
hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In preparation for hugetlb to safely handle faults under the VMA lock, use vmf_anon_prepare() here instead. Additionally, passing hugetlb_wp() the vm_fault struct from hugetlb_fault() works toward cleaning up the hugetlb code and function stack. Link: https://lkml.kernel.org/r/20240221234732.187629-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04hugetlb: pass struct vm_fault through to hugetlb_handle_userfault()Gravatar Vishal Moola (Oracle) 1-29/+9
Now that hugetlb_fault() has a struct vm_fault, have hugetlb_handle_userfault() use it instead of creating one of its own. This lets us reduce the number of arguments passed to hugetlb_handle_userfault() from 7 to 3, cleaning up the code and stack. Link: https://lkml.kernel.org/r/20240221234732.187629-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04hugetlb: move vm_fault declaration to the top of hugetlb_fault()Gravatar Vishal Moola (Oracle) 1-13/+19
hugetlb_fault() currently defines a vm_fault to pass to the generic handle_userfault() function. We can move this definition to the top of hugetlb_fault() so that it can be used throughout the rest of the hugetlb fault path. This will help cleanup a number of excess variables and function arguments throughout the stack. Also, since vm_fault already has space to store the page offset, use that instead and get rid of idx. Link: https://lkml.kernel.org/r/20240221234732.187629-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/memory: change vmf_anon_prepare() to be non-staticGravatar Vishal Moola (Oracle) 2-1/+2
Patch series "Handle hugetlb faults under the VMA lock", v2. It is generally safe to handle hugetlb faults under the VMA lock. The only time this is unsafe is when no anon_vma has been allocated to this vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare() to bailout if necessary. This should only happen for the first hugetlb page in the vma. Additionally, this patchset begins to use struct vm_fault within hugetlb_fault(). This works towards cleaning up hugetlb code, and should significantly reduce the number of arguments passed to functions. The last patch in this series may cause ltp hugemmap10 to "fail". This is because vmf_anon_prepare() may bailout with no anon_vma under the VMA lock after allocating a folio for the hugepage. In free_huge_folio(), this folio is completely freed on bailout iff there is a surplus of hugetlb pages. This will remove a folio off the freelist and decrement the number of hugepages while ltp expects these counters to remain unchanged on failure. The rest of the ltp testcases pass. This patch (of 2): In order to handle hugetlb faults under the VMA lock, hugetlb can use vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change it to be a non-static function so it can be used within hugetlb as well. Link: https://lkml.kernel.org/r/20240221234732.187629-6-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20240221234732.187629-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/page_alloc: make check_new_page() return boolGravatar Hao Ge 1-3/+3
Make check_new_page() return bool like check_new_pages() Link: https://lkml.kernel.org/r/20240222091932.54799-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/util.c: add byte count to __vm_enough_memory failure warningGravatar Matthew Cassell 1-2/+4
Commit 44b414c8715c5dcf53288 ("mm/util.c: add warning if __vm_enough_memory fails") adds debug information which gives the process id and executable name should __vm_enough_memory() fail. Adding the number of pages to the failure message would benefit application developers and system administrators in debugging overambitious memory requests by providing a point of reference to the amount of memory causing __vm_enough_memory() to fail. 1. Set appropriate kernel tunable to reach code path for failure message: # echo 2 > /proc/sys/vm/overcommit_memory 2. Test program to generate failure - requests 1 gibibyte per iteration: #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { for(;;) { if(malloc(1<<30) == NULL) break; printf("allocated 1 GiB\n"); } return 0; } 3. Output: Before: __vm_enough_memory: pid: 1218, comm: a.out, not enough memory for the allocation After: __vm_enough_memory: pid: 1137, comm: a.out, bytes: 1073741824, not enough memory for the allocation Link: https://lkml.kernel.org/r/20240222194617.1255-1-mcassell411@gmail.com Signed-off-by: Matthew Cassell <mcassell411@gmail.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/zswap: change zswap_pool kref to percpu_refGravatar Chengming Zhou 1-15/+33
All zswap entries will take a reference of zswap_pool when zswap_store(), and drop it when free. Change it to use the percpu_ref is better for scalability performance. Although percpu_ref use a bit more memory which should be ok for our use case, since we almost have only one zswap_pool to be using. The performance gain is for zswap_store/load hotpath. Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs x86-64 machine, below is the average of 5 runs) mm-unstable zswap-global-lru real 63.20 63.12 user 1061.75 1062.95 sys 268.74 264.44 [chengming.zhou@linux.dev: fix zswap_pools_lock usages after changing to percpu_ref] Link: https://lkml.kernel.org/r/20240228154954.3028626-1-chengming.zhou@linux.dev Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-2-200495333595@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/zswap: global lru and shrinker shared by all zswap_poolsGravatar Chengming Zhou 1-105/+66
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3. Dynamic pool creation has been supported for a long time, which maybe not used so much in practice. But with the per-memcg lru merged, the current structure of zswap_pool's lru and shrinker become less optimal. In the current structure, each zswap_pool has its own lru, shrinker and shrink_work, but only the latest zswap_pool will be the current used. 1. When memory has pressure, all shrinkers of zswap_pools will try to shrink its lru list, there is no order between them. 2. When zswap limit hit, only the last zswap_pool's shrink_work will try to shrink its own lru, which is inefficient. A more natural way is to have a global zswap lru shared between all zswap_pools, and so is the shrinker. The code becomes much simpler too. Another optimization is changing zswap_pool kref to percpu_ref, which will be taken reference by every zswap entry. So the scalability is better. Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs x86-64 machine, below is the average of 5 runs) mm-unstable zswap-global-lru real 63.20 63.12 user 1061.75 1062.95 sys 268.74 264.44 This patch (of 3): Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools in a list, only the first will be current used. Each zswap_pool has its own lru and shrinker, which is not necessary and has its problem: 1. When memory has pressure, all shrinker of zswap_pools will try to shrink its own lru, there is no order between them. 2. When zswap limit hit, only the last zswap_pool's shrink_work will try to shrink its lru list. The rationale here was to try and empty the old pool first so that we can completely drop it. However, since we only support exclusive loads now, the LRU ordering should be entirely decided by the order of stores, so the oldest entries on the LRU will naturally be from the oldest pool. Anyway, having a global lru and shrinker shared by all zswap_pools is better and efficient. Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, mmap: fix vma_merge() case 7 with vma_ops->closeGravatar Vlastimil Babka 1-1/+9
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with a reproducer that shows how a series of mprotect() operations can result in an elevated shm_nattch and thus leak of the resource. The problem is caused by wrong assumptions in vma_merge() commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test"). The shmem vmas have a vma_ops->close callback that decrements shm_nattch, and we remove the vma without calling it. vma_merge() has thus historically avoided merging vma's with vma_ops->close and commit 714965ca8252 was supposed to keep it that way. It relaxed the checks for vma_ops->close in can_vma_merge_after() assuming that it is never called on a vma that would be a candidate for removal. However, the vma_merge() code does also use the result of this check in the decision to remove a different vma in the merge case 7. A robust solution would be to refactor vma_merge() code in a way that the vma_ops->close check is only done for vma's that are actually going to be removed, and not as part of the preliminary checks. That would both solve the existing bug, and also allow additional merges that the checks currently prevent unnecessarily in some cases. However to fix the existing bug first with a minimized risk, and for easier stable backports, this patch only adds a vma_ops->close check to the buggy case 7 specifically. All other cases of vma removal are covered by the can_vma_merge_before() check that includes the test for vma_ops->close. The reproducer code, adapted from Michal Hocko's code: int main(int argc, char *argv[]) { int segment_id; size_t segment_size = 20 * PAGE_SIZE; char * sh_mem; struct shmid_ds shmid_ds; key_t key = 0x1234; segment_id = shmget(key, segment_size, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR); sh_mem = (char *)shmat(segment_id, NULL, 0); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE); mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE); shmdt(sh_mem); shmctl(segment_id, IPC_STAT, &shmid_ds); printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch); if (shmctl(segment_id, IPC_RMID, 0)) printf("IPCRM failed %d\n", errno); return (shmid_ds.shm_nattch) ? 1 : 0; } Link: https://lkml.kernel.org/r/20240222215930.14637-2-vbabka@suse.cz Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE failsGravatar Qi Zheng 1-3/+3
After ptep_clear_flush(), if we find that src_folio is pinned we will fail UFFDIO_MOVE and put src_folio back to src_pte entry, but the change to src_folio->{mapping,index} is not restored in this process. This is not what we expected, so fix it. This can cause the rmap for that page to be invalid, possibly resulting in memory corruption. At least swapout+migration would no longer work, because we might fail to locate the mappings of that folio. Link: https://lkml.kernel.org/r/20240222080815.46291-1-zhengqi.arch@bytedance.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL ↵Gravatar Vlastimil Babka 3-11/+11
allocations Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sven van Ashbrook <svenva@chromium.org> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/ Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Curtis Malainey <cujomalainey@chromium.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, slab: remove memcg_from_slab_obj()Gravatar Vlastimil Babka 1-5/+0
This empty wrapped exists only for !CONFIG_MEMCG_KMEM and seems it was never used. Probably a leftover from development of a series. Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>