aboutsummaryrefslogtreecommitdiff
path: root/mm/vmscan.c
AgeCommit message (Collapse)AuthorFilesLines
36 hoursMerge tag 'random-6.11-rc1-for-linus' of ↵Gravatar Linus Torvalds 1-9/+0
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "This adds getrandom() support to the vDSO. First, it adds a new kind of mapping to mmap(2), MAP_DROPPABLE, which lets the kernel zero out pages anytime under memory pressure, which enables allocating memory that never gets swapped to disk but also doesn't count as being mlocked. Then, the vDSO implementation of getrandom() is introduced in a generic manner and hooked into random.c. Next, this is implemented on x86. (Also, though it's not ready for this pull, somebody has begun an arm64 implementation already) Finally, two vDSO selftests are added. There are also two housekeeping cleanup commits" * tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: MAINTAINERS: add random.h headers to RNG subsection random: note that RNDGETPOOL was removed in 2.6.9-rc2 selftests/vDSO: add tests for vgetrandom x86: vdso: Wire up getrandom() vDSO implementation random: introduce generic vDSO getrandom() implementation mm: add MAP_DROPPABLE for designating always lazily freeable mappings
6 daysmm: add MAP_DROPPABLE for designating always lazily freeable mappingsGravatar Jason A. Donenfeld 1-9/+0
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. - It shouldn't reserve actual memory, but it also shouldn't crash when page faulting in memory if none is available * Uh-oh: VM_NORESERVE means segfaults. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) It is inherited by fork. d) It doesn't count against the mlock budget, since nothing is locked. e) If there's not enough memory to service a page fault, it's not fatal, and no signal is sent. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with OOMing, crashing on overcommitment, using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. In order to let vDSO getrandom() use this, expose these via mmap(2) as MAP_DROPPABLE. Note that this involves removing the MADV_FREE special case from sort_folio(), which according to Yu Zhao is unnecessary and will simply result in an extra call to shrink_folio_list() in the worst case. The chunk removed reenables the swapbacked flag, which we don't want for VM_DROPPABLE, and we can't conditionalize it here because there isn't a vma reference available. Finally, the provided self test ensures that this is working as desired. Cc: linux-mm@kvack.org Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
8 daysmm/mglru: fix ineffective protection calculationGravatar Yu Zhao 1-44/+38
mem_cgroup_calculate_protection() is not stateless and should only be used as part of a top-down tree traversal. shrink_one() traverses the per-node memcg LRU instead of the root_mem_cgroup tree, and therefore it should not call mem_cgroup_calculate_protection(). The existing misuse in shrink_one() can cause ineffective protection of sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to calculate the protection. Previously lru_gen_age_node() opportunistically skips the first pass, i.e., when scan_control->priority is DEF_PRIORITY. On the second pass, lruvec_is_sizable() uses appropriate scan_control->priority, set by set_initial_priority() from lru_gen_shrink_node(), to decide whether a memcg is too small to reclaim from. Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree. So it should call set_initial_priority() upfront, to make sure lruvec_is_sizable() uses appropriate scan_control->priority on the first pass. Otherwise, lruvec_is_reclaimable() can return false negatives and result in premature OOM kills when min_ttl_ms is used. Link: https://lkml.kernel.org/r/20240712232956.1427127-1-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/mglru: fix overshooting shrinker memoryGravatar Yu Zhao 1-2/+8
set_initial_priority() tries to jump-start global reclaim by estimating the priority based on cold/hot LRU pages. The estimation does not account for shrinker objects, and it cannot do so because their sizes can be in different units other than page. If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where ZFS ARC can use almost all system memory, set_initial_priority() can vastly underestimate how much memory ARC shrinker can evict and assign extreme low values to scan_control->priority, resulting in overshoots of shrinker objects. To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a test ZFS pool and the following commands: fio --name=mglru.file --numjobs=36 --ioengine=io_uring \ --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \ --rw=randread --random_distribution=random \ --time_based --runtime=1h & for ((i = 0; i < 20; i++)) do sleep 120 fio --name=mglru.anon --numjobs=16 --ioengine=mmap \ --filename=/dev/zero --size=1024m --fadvise_hint=0 \ --rw=randrw --random_distribution=random \ --time_based --runtime=1m done To fix the problem: 1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent the jump-start from being overly aggressive. 2. Account for the progress from mm_account_reclaimed_pages(), to prevent kswapd_shrink_node() from raising the priority unnecessarily. Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Alexander Motin <mav@ixsystems.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
8 daysmm/mglru: fix div-by-zero in vmpressure_calc_level()Gravatar Yu Zhao 1-1/+0
evict_folios() uses a second pass to reclaim folios that have gone through page writeback and become clean before it finishes the first pass, since folio_rotate_reclaimable() cannot handle those folios due to the isolation. The second pass tries to avoid potential double counting by deducting scan_control->nr_scanned. However, this can result in underflow of nr_scanned, under a condition where shrink_folio_list() does not increment nr_scanned, i.e., when folio_trylock() fails. The underflow can cause the divisor, i.e., scale=scanned+reclaimed in vmpressure_calc_level(), to become zero, resulting in the following crash: [exception RIP: vmpressure_work_fn+101] process_one_work at ffffffffa3313f2b Since scan_control->nr_scanned has no established semantics, the potential double counting has minimal risks. Therefore, fix the problem by not deducting scan_control->nr_scanned in evict_folios(). Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by: Wei Xu <weixugc@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Motin <mav@ixsystems.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10mm/vmscan: drop checking if _deferred_list is empty before using TTU_SYNCGravatar Barry Song 1-1/+1
The optimization of list_empty(&folio->_deferred_list) aimed to prevent increasing the PTL duration when a large folio is partially unmapped, for example, from subpage 0 to subpage (nr - 2). But Ryan's commit 5ed890ce5147 ("mm: vmscan: avoid split during shrink_folio_list()") actually splits this kind of large folios. This makes the "optimization" useless. Additionally, the list_empty() technically required a data_race() annotation. Link: https://lkml.kernel.org/r/20240629234155.53524-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: add swappiness= arg to memory.reclaimGravatar Dan Schatzberg 1-3/+22
Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: add defines for min/max swappinessGravatar Dan Schatzberg 1-6/+6
Patch series "Add swappiness argument to memory.reclaim", v6. This patch proposes augmenting the memory.reclaim interface with a swappiness=<val> argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ This patch (of 2): We use the constants 0 and 200 in a few places in the mm code when referring to the min and max swappiness. This patch adds MIN_SWAPPINESS and MAX_SWAPPINESS #defines to improve clarity. There are no functional changes. Link: https://lkml.kernel.org/r/20240103164841.2800183-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20240103164841.2800183-2-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: rename soft limit reclaim-related functionsGravatar Roman Gushchin 1-5/+5
Rename exported function related to the softlimit reclaim to have memcg1_ prefix. Link: https://lkml.kernel.org/r/20240625005906.106920-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: refactor folio_undo_large_rmappable()Gravatar Kefeng Wang 1-6/+2
Folios of order <= 1 are not in deferred list, the check of order is added into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support order-1 folios in the page cache"), but there is a repeated check for small folio (order 0) during each call of the folio_undo_large_rmappable(), so only keep folio_order() check inside the function. In addition, move all the checks into header file to save a function call for non-large-rmappable or empty deferred_list folio. Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: rename alloc_demote_folio to alloc_migrate_folioGravatar Honggyu Kim 1-2/+2
The alloc_demote_folio can also be used for general migration including both demotion and promotion so it'd be better to rename it from alloc_demote_folio to alloc_migrate_folio. Link: https://lkml.kernel.org/r/20240614030010.751-3-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: make alloc_demote_folio externally invokable for migrationGravatar Honggyu Kim 1-2/+1
Patch series "DAMON based tiered memory management for CXL memory", v6. Introduction ============ With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 11% to 3~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration =================== The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload =================== The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these migrate_{hot,cold} actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa balancing, but it is not the scope of this DAMON based tiered memory management support. If the working set of redis can be fit fully into the DRAM node, then the redis will access the fast DRAM only. Since the performance of DRAM only is faster than partially accessing CXL memory in slow tiers, this environment is not useful to evaluate this patch series. To make pages of redis be distributed across fast DRAM node and slow CXL node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold memory externally using mmap and memset before launching redis-server. We assumed that there are enough amount of cold memory in datacenters as TMO[6] and TPP[7] papers mentioned. The evaluation sequence is as follows. 1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and DAMOS_MIGRATE_HOT action for CXL node. It demotes cold pages on DRAM node and promotes hot pages on CXL node in a regular interval. 2. Allocate a huge block of cold memory by calling mmap and memset at the fast tier DRAM node, then make the process sleep to make the fast tier has insufficient space for redis-server. 3. Launch redis-server and load prebaked snapshot image, dump.rdb. The redis-server consumes 52GB of anon pages and 33GB of file pages, but due to the cold memory allocated at 2, it fails allocating the entire memory of redis-server on the fast tier DRAM node so it partially allocates the remaining on the slow tier CXL node. The ratio of DRAM:CXL depends on the size of the pre-allocated cold memory. 4. Run YCSB to make zipfian or latest distribution of memory accesses to redis-server, then measure its execution time when it's completed. 5. Repeat 4 over 50 times to measure the average execution time for each run. 6. Increase the cold memory size then repeat goes to 2. For each test at 4 took about a minute so repeating it 50 times almost took about 1 hour for each test with a specific cold memory from 440GB to 500GB in 10GB increments for each evaluation. So it took about more than 10 hours for both zipfian and latest workloads to get the entire evaluation results. Repeating the same test set multiple times doesn't show much difference so I think it might be enough to make the result reliable. Evaluation Results ================== All the result values are normalized to DRAM-only execution time because the workload cannot be faster than DRAM-only unless the workload hits the peak bandwidth but our redis test doesn't go beyond the bandwidth limit. So the DRAM-only execution time is the ideal result without affected by the gap between DRAM and CXL performance difference. The NUMA node environment is as follows. node0 - local DRAM, 512GB with a CPU socket (fast tier) node1 - disabled node2 - CXL DRAM, 96GB, no CPU attached (slow tier) The following is the result of generating zipfian distribution to redis-server and the numbers are averaged by 50 times of execution. 1. YCSB zipfian distribution read only workload memory pressure with cold memory on node0 with 512GB of local DRAM. ====================+================================================+========= | cold memory occupied by mmap and memset | | 0G 440G 450G 460G 470G 480G 490G 500G | ====================+================================================+========= Execution time normalized to DRAM-only values | GEOMEAN --------------------+------------------------------------------------+--------- DRAM-only | 1.00 - - - - - - - | 1.00 CXL-only | 1.19 - - - - - - - | 1.19 default | - 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | 1.11 DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | 1.04 DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | 1.05 ====================+================================================+========= CXL usage of redis-server in GB | AVERAGE --------------------+------------------------------------------------+--------- DRAM-only | 0.0 - - - - - - - | 0.0 CXL-only | 51.4 - - - - - - - | 51.4 default | - 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | 28.7 DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2 DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5 ====================+================================================+========= Each test result is based on the execution environment as follows. DRAM-only: redis-server uses only local DRAM memory. CXL-only: redis-server uses only CXL memory. default: default memory policy(MPOL_DEFAULT). numa balancing disabled. DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and DAMOS_MIGRATE_HOT for CXL nodes. DAMON lazy: same as DAMON tiered, but turn on DAMON just before making memory access request via YCSB. The above result shows the "default" execution time goes up as the size of cold memory is increased from 440G to 500G because the more cold memory used, the more CXL memory is used for the target redis workload and this makes the execution time increase. However, "DAMON tiered" and other DAMON results show less slowdown because the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated cold memory to CXL node and this free space at DRAM increases more chance to allocate hot or warm pages of redis-server to fast DRAM node. Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages of redis-server to DRAM node actively. As a result, it makes more memory of redis-server stay in DRAM node compared to "default" memory policy and this makes the performance improvement. Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at 500G are marked with * stars, which means their test results are replaced with reproduced tests that didn't have OOM issue. That was needed because sometimes the test processes get OOM when DRAM has insufficient space. The DAMOS_MIGRATE_HOT doesn't kick reclaim but just gives up migration when there is not enough space at DRAM side. The problem happens when there is competition between normal allocation and migration and the migration is done before normal allocation, then the completely unrelated normal allocation can trigger reclaim, which incurs OOM. Because of this issue, I have also tested more cases with "demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM, but just demote reclaimed pages. The following test results show more tests with "kswapd" marked. 2. YCSB zipfian distribution read only workload (with demotion_enabled true) memory pressure with cold memory on node0 with 512GB of local DRAM. ====================+================================================+========= | cold memory occupied by mmap and memset | | 0G 440G 450G 460G 470G 480G 490G 500G | ====================+================================================+========= Execution time normalized to DRAM-only values | GEOMEAN --------------------+------------------------------------------------+--------- DAMON tiered | - 1.03 1.03 1.03 1.03 1.03 1.07 1.05 | 1.04 DAMON lazy | - 1.04 1.03 1.04 1.05 1.06 1.06 1.06 | 1.05 DAMON tiered kswapd | - 1.03 1.03 1.03 1.03 1.02 1.02 1.03 | 1.03 DAMON lazy kswapd | - 1.04 1.04 1.04 1.03 1.05 1.04 1.05 | 1.04 ====================+================================================+========= CXL usage of redis-server in GB | AVERAGE --------------------+------------------------------------------------+--------- DAMON tiered | - 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2 DAMON lazy | - 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5 DAMON tiered kswapd | - 0.0 0.0 0.4 0.5 0.1 0.8 1.0 | 0.4 DAMON lazy kswapd | - 4.2 4.6 5.3 1.7 6.8 8.1 5.8 | 5.2 ====================+================================================+========= Each test result is based on the exeuction environment as follows. DAMON tiered: same as before DAMON lazy: same as before DAMON tiered kswapd: same as DAMON tiered, but turn on /sys/kernel/mm/numa/demotion_enabled to make kswapd or direct reclaim does demotion. DAMON lazy kswapd: same as DAMON lazy, but turn on /sys/kernel/mm/numa/demotion_enabled to make kswapd or direct reclaim does demotion. The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at all unlike other tests because kswapd and direct reclaim from DRAM node can demote reclaimed pages to CXL node independently from DAMON actions and their results are slightly better than without having "demotion_enabled". In summary, the evaluation results show that DAMON memory management with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared to the "default" memory policy from 11% to 3~5% when the system runs with high memory pressure on its fast tier DRAM nodes. Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make tiered memory systems run more efficiently under high memory pressures. This patch (of 7): The alloc_demote_folio can be used out of vmscan.c so it'd be better to remove static keyword from it. Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: vmscan: reset sc->priority on retryGravatar Shakeel Butt 1-0/+1
The commit 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration") added a retry reclaim heuristic to iterate all the cgroups before returning an unsuccessful reclaim but missed to reset the sc->priority. Let's fix it. Link: https://lkml.kernel.org/r/20240529154911.3008025-1-shakeel.butt@linux.dev Fixes: 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Tested-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: vmscan: restore incremental cgroup iterationGravatar Johannes Weiner 1-2/+40
Currently, reclaim always walks the entire cgroup tree in order to ensure fairness between groups. While overreclaim is limited in shrink_lruvec(), many of our systems have a sizable number of active groups, and an even bigger number of idle cgroups with cache left behind by previous jobs; the mere act of walking all these cgroups can impose significant latency on direct reclaimers. In the past, we've used a save-and-restore iterator that enabled incremental tree walks over multiple reclaim invocations. This ensured fairness, while keeping the work of individual reclaimers small. However, in edge cases with a lot of reclaim concurrency, individual reclaimers would sometimes not see enough of the cgroup tree to make forward progress and (prematurely) declare OOM. Consequently we switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not share cgroup iteration between reclaimers"). To address the latency problem without bringing back the premature OOM issue, reinstate the shared iteration, but with a restart condition to do the full walk in the OOM case - similar to what we do for memory.low enforcement and active page protection. In the worst case, we do one more full tree walk before declaring OOM. But the vast majority of direct reclaim scans can then finish much quicker, while fairness across the tree is maintained: - Before this patch, we observed that direct reclaim always takes more than 100us and most direct reclaim time is spent in reclaim cycles lasting between 1ms and 1 second. Almost 40% of direct reclaim time was spent on reclaim cycles exceeding 100ms. - With this patch, almost all page reclaim cycles last less than 10ms, and a good amount of direct page reclaim finishes in under 100us. No page reclaim cycles lasting over 100ms were observed anymore. The shared iterator state is maintaned inside the target cgroup, so fair and incremental walks are performed during both global reclaim and cgroup limit reclaim of complex subtrees. Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Facebook Kernel Team <kernel-team@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-05mm: drop the 'anon_' prefix for swap-out mTHP countersGravatar Baolin Wang 1-1/+1
The mTHP swap related counters: 'anon_swpout' and 'anon_swpout_fallback' are confusing with an 'anon_' prefix, since the shmem can swap out non-anonymous pages. So drop the 'anon_' prefix to keep consistent with the old swap counter names. This is needed in 6.10-rcX to avoid having an inconsistent ABI out in the field. Link: https://lkml.kernel.org/r/7a8989c13299920d7589007a30065c3e2c19f0e0.1716431702.git.baolin.wang@linux.alibaba.com Fixes: d0f048ac39f6 ("mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters") Fixes: 42248b9d34ea ("mm: add docs for per-order mTHP counters and transhuge_page ABI") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_folio_list()Gravatar SeongJae Park 1-6/+4
All reclaim_folio_list() callers are passing 'true' for 'ignore_references' parameter. In other words, the parameter is not really being used. Simplify the code by removing the parameter. Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_pages()Gravatar SeongJae Park 1-3/+3
All reclaim_pages() callers are setting 'ignore_references' parameter 'true'. In other words, the parameter is not really being used. Remove the argument to make it simple. Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: add per-order mTHP anon_swpout and anon_swpout_fallback countersGravatar Barry Song 1-0/+3
This helps to display the fragmentation situation of the swapfile, knowing the proportion of how much we haven't split large folios. So far, we only support non-split swapout for anon memory, with the possibility of expanding to shmem in the future. So, we add the "anon" prefix to the counter names. Link: https://lkml.kernel.org/r/20240412114858.407208-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: vmscan: avoid split during shrink_folio_list()Gravatar Ryan Roberts 1-10/+10
Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series. If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily. THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count events only for PMD-mappable folios to avoid user confusion. THP_SWPOUT already has the appropriate guard. Add a guard for THP_SWPOUT_FALLBACK. It may be appropriate to add per-size counters in future. Link: https://lkml.kernel.org/r/20240408183946.2991168-7-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: hold PTL from the first PTE while reclaiming a large folioGravatar Barry Song 1-0/+14
Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE modifications preceded by pte clear. While iterating over PTEs of a large folio, it only starts acquiring PTL from the first valid (present) PTE. PTE modifications can temporarily set PTEs to pte_none. Consequently, the initial PTEs of a large folio might be skipped in try_to_unmap_one(). For example, for an anon folio, if we skip PTE0, we may have PTE0 which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after try_to_unmap_one(). So folio will be still mapped, the folio fails to be reclaimed and is put back to LRU in this round. This also breaks up PTEs optimization such as CONT-PTE on this large folio and may lead to accident folio_split() afterwards. And since a part of PTEs are now swap entries, accessing those parts will introduce overhead - do_swap_page. Although the kernel can withstand all of the above issues, the situation still seems quite awkward and warrants making it more ideal. The same race also occurs with small folios, but they have only one PTE, thus, it won't be possible for them to be partially unmapped. This patch holds PTL from PTE0, allowing us to avoid reading PTE values that are in the process of being transformed. With stable PTE values, we can ensure that this large folio is either completely reclaimed or that all PTEs remain untouched in this round. A corner case is that if we hold PTL from PTE0 and most initial PTEs have been really unmapped before that, we may increase the duration of holding PTL. Thus we only apply this optimization to folios which are still entirely mapped (not in deferred_split list). [akpm@linux-foundation.org: rewrap comment, per Matthew] Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: record the migration reason for struct migration_target_controlGravatar Baolin Wang 1-1/+2
Patch series "make the hugetlb migration strategy consistent", v2. As discussed in previous thread [1], there is an inconsistency when handling hugetlb migration. When handling the migration of freed hugetlb, it prevents fallback to other NUMA nodes in alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use hugetlb, it allows fallback to other NUMA nodes in alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool and might result in unexpected failures when node bound workloads doesn't get what is asssumed available. This patchset tries to make the hugetlb migration strategy more clear and consistent. Please find details in each patch. [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/ This patch (of 2): To support different hugetlb allocation strategies during hugetlb migration based on various migration reasons, record the migration reason in the migration_target_control structure as a preparation. Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-18Merge branch 'master' into mm-stableGravatar Andrew Morton 1-1/+4
2024-03-12mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failureGravatar Byungchul Park 1-1/+20
With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon pages. However, it should be more careful to use the mode because it's going to prevent anon pages from being reclaimed even if there are a huge number of anon pages that are cold and should be reclaimed. Even worse, that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping kswapd from functioning until direct reclaim eventually works to resume kswapd. So kswapd needs to retry its scan priority loop with cache_trim_mode off again if the mode doesn't work for reclaim. The problematic behavior can be reproduced by: CONFIG_NUMA_BALANCING enabled sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING numa node0 (8GB local memory, 16 CPUs) numa node1 (8GB slow tier memory, no CPUs) Sequence: 1) echo 3 > /proc/sys/vm/drop_caches 2) To emulate the system with full of cold memory in local DRAM, run the following dummy program and never touch the region: mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0); 3) Run any memory intensive work e.g. XSBench. 4) Check if numa balancing is working e.i. promotion/demotion. 5) Iterate 1) ~ 4) until numa balancing stops. With this, you could see that promotion/demotion are not working because kswapd has stopped due to ->kswapd_failures >= MAX_RECLAIM_RETRIES. Interesting vmstat delta's differences between before and after are like: +-----------------------+-------------------------------+ | interesting vmstat | before | after | +-----------------------+-------------------------------+ | nr_inactive_anon | 321935 | 1664772 | | nr_active_anon | 1780700 | 437834 | | nr_inactive_file | 30425 | 40882 | | nr_active_file | 14961 | 3012 | | pgpromote_success | 356 | 1293122 | | pgpromote_candidate | 21953245 | 1824148 | | pgactivate | 1844523 | 3311907 | | pgdeactivate | 50634 | 1554069 | | pgfault | 31100294 | 6518806 | | pgdemote_kswapd | 30856 | 2230821 | | pgscan_kswapd | 1861981 | 7667629 | | pgscan_anon | 1822930 | 7610583 | | pgscan_file | 39051 | 57046 | | pgsteal_anon | 386 | 2192033 | | pgsteal_file | 30470 | 38788 | | pageoutrun | 30 | 412 | | numa_hint_faults | 27418279 | 2875955 | | numa_pages_migrated | 356 | 1293122 | +-----------------------+-------------------------------+ Link: https://lkml.kernel.org/r/20240304082118.20499-1-byungchul@sk.com Signed-off-by: Byungchul Park <byungchul@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12mm: remove folio from deferred split list before uncharging itGravatar Matthew Wilcox (Oracle) 1-0/+6
When freeing a large folio, we must remove it from the deferred split list before we uncharge it as each memcg has its own deferred split list (with associated lock) and removing a folio from the deferred split list while holding the wrong lock will corrupt that list and cause various related problems. Link: https://lore.kernel.org/linux-mm/367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com/ Link: https://lkml.kernel.org/r/20240311191835.312162-1-willy@infradead.org Fixes: f77171d241e3 (mm: allow non-hugetlb large folios to be batch processed) Fixes: 29f3843026cf (mm: free folios directly in move_folios_to_lru()) Fixes: bc2ff4cbc329 (mm: free folios in a batch in shrink_folio_list()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Debugged-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: free folios directly in move_folios_to_lru()Gravatar Matthew Wilcox (Oracle) 1-20/+12
The few folios which can't be moved to the LRU list (because their refcount dropped to zero) used to be returned to the caller to dispose of. Make this simpler to call by freeing the folios directly through free_unref_folios(). Link: https://lkml.kernel.org/r/20240227174254.710559-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: free folios in a batch in shrink_folio_list()Gravatar Matthew Wilcox (Oracle) 1-11/+9
Use free_unref_page_batch() to free the folios. This may increase the number of IPIs from calling try_to_unmap_flush() more often, but that's going to be very workload-dependent. It may even reduce the number of IPIs as we now batch-free large folios instead of freeing them one at a time. Link: https://lkml.kernel.org/r/20240227174254.710559-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: madvise: pageout: ignore references rather than clearing youngGravatar Barry Song 1-5/+7
While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan. Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better. For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages(). A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT, #define PGSIZE 4096 main() { int i; #define SIZE 512*1024*1024 volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11; madvise(p, SIZE, MADV_PAGEOUT); } w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL ↵Gravatar Vlastimil Babka 1-1/+4
allocations Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sven van Ashbrook <svenva@chromium.org> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/ Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Curtis Malainey <cujomalainey@chromium.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mglru: improve swappiness handlingGravatar Kinsey Ho 1-10/+10
The reclaimable number of anon pages used to set initial reclaim priority is only based on get_swappiness(). Use can_reclaim_anon_pages() to include NUMA node demotion. Also move the swappiness handling of when !__GFP_IO in try_to_shrink_lruvec() into isolate_folios(). Link: https://lkml.kernel.org/r/20240214060538.3524462-6-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mglru: improve struct lru_gen_mm_walkGravatar Kinsey Ho 1-24/+26
Rename max_seq to seq in struct lru_gen_mm_walk to keep consistent with struct lru_gen_mm_state. Note that seq is not always up to date with max_seq from lru_gen_folio. No functional changes. Link: https://lkml.kernel.org/r/20240214060538.3524462-5-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mglru: improve reset_mm_stats()Gravatar Kinsey Ho 1-20/+22
struct lruvec* is already a field of struct lru_gen_mm_walk. Remove the parameter struct lruvec* into functions that already have access to struct lru_gen_mm_walk*. Also, we do not need to handle reset histogram stats when !should_walk_mmu(). Remove the call to reset_mm_stats() in iterate_mm_list_nowalk(). Link: https://lkml.kernel.org/r/20240214060538.3524462-4-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mglru: improve should_run_aging()Gravatar Kinsey Ho 1-14/+11
scan_control *sc does not need to be passed into should_run_aging(), as it provides only the reclaim priority. This can be moved to get_nr_to_scan(). Refactor should_run_aging() and get_nr_to_scan() to improve code readability. No functional changes. Link: https://lkml.kernel.org/r/20240214060538.3524462-3-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mglru: drop unused parameterGravatar Kinsey Ho 1-5/+5
Patch series "mm/mglru: code cleanup and refactoring" This provides MGLRU code cleanup and refactoring for better readability. This patch (of 5): struct scan_control *sc is currently passed into try_to_inc_max_seq() and run_aging(). This parameter is not used. Drop the unused parameter struct scan_control *sc. No functional change. Link: https://lkml.kernel.org/r/20240214060538.3524462-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240214060538.3524462-2-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Donet Tom <donettom@linux.vnet.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/vmscan: make too_many_isolated return boolGravatar Hao Ge 1-3/+3
too_many_isolated() should return bool as does the similar too_many_isolated() in mm/compaction.c. Link: https://lkml.kernel.org/r/20240205042618.108140-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/vmscan: change the type of file from int to boolGravatar Hao Ge 1-2/+2
Change the type of file from int to bool because is_file_lru return bool Link: https://lkml.kernel.org/r/20240131103802.122920-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22kswapd: replace try_to_freeze() with kthread_freezable_should_stop()Gravatar Levi Yun 1-6/+6
Instead of using try_to_freeze, use kthread_freezable_should_stop in kswapd. By this, we can avoid unnecessary freezing when kswapd should stop. Link: https://lkml.kernel.org/r/20240126152556.58791-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Gravatar Linus Torvalds 1-130/+150
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERGravatar Kirill A. Shutemov 1-1/+1
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCINGGravatar Li Zhijian 1-4/+1
Demotion can work well without CONFIG_NUMA_BALANCING. But the commit 23e9f0138963 ("mm/vmstat: move pgdemote_* to per-node stats") wrongly hid it behind CONFIG_NUMA_BALANCING. Fix it by moving them out of CONFIG_NUMA_BALANCING. Link: https://lkml.kernel.org/r/20231229022651.3229174-1-lizhijian@fujitsu.com Fixes: 23e9f0138963 ("mm/vmstat: move pgdemote_* to per-node stats") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGEGravatar Kinsey Ho 1-11/+1
Improve code readability by removing CONFIG_TRANSPARENT_HUGEPAGE, since the compiler should be able to automatically optimize out the code that promotes THPs during page table walks. No functional changes. Link: https://lkml.kernel.org/r/20231227141205.2200125-6-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.vnet.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/mglru: remove CONFIG_MEMCGGravatar Kinsey Ho 1-46/+21
Remove CONFIG_MEMCG in a refactoring to improve code readability at the cost of a few bytes in struct lru_gen_folio per node when CONFIG_MEMCG=n. Link: https://lkml.kernel.org/r/20231227141205.2200125-4-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.vnet.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm/mglru: add CONFIG_LRU_GEN_WALKS_MMUGravatar Kinsey Ho 1-69/+123
Add CONFIG_LRU_GEN_WALKS_MMU such that if disabled, the code that walks page tables to promote pages into the youngest generation will not be built. Also improves code readability by adding two helper functions get_mm_state() and get_next_mm(). Link: https://lkml.kernel.org/r/20231227141205.2200125-3-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Co-developed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.vnet.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29mm/mglru: skip special VMAs in lru_gen_look_around()Gravatar Yu Zhao 1-4/+9
Special VMAs like VM_PFNMAP can contain anon pages from COW. There isn't much profit in doing lookaround on them. Besides, they can trigger the pte_special() warning in get_pte_pfn(). Skip them in lru_gen_look_around(). Link: https://lkml.kernel.org/r/20231223045647.1566043-1-yuzhao@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: syzbot+03fd9b3f71641f0ebf2d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/000000000000f9ff00060d14c256@google.com/ Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20mm: memcg: restore subtree stats flushingGravatar Yosry Ahmed 1-1/+1
Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20sync mm-stable with mm-hotfixes-stable to pick up depended-upon changesGravatar Andrew Morton 1-31/+61
2023-12-12mm/mglru: reclaim offlined memcgs harderGravatar Yu Zhao 1-8/+16
In the effort to reduce zombie memcgs [1], it was discovered that the memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically, instead of rotating them to the tail of the current generation (MEMCG_LRU_TAIL) for a second attempt, it moves them to the next generation (MEMCG_LRU_YOUNG) after the first attempt. Not applying enough pressure on offlined memcgs can cause them to build up, and this can be particularly harmful to memory-constrained systems. On Pixel 8 Pro, launching apps for 50 cycles: Before After Change Zombie memcgs 45 35 -22% [1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: respect min_ttl_ms with memcgsGravatar Yu Zhao 1-14/+16
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru: try to stop at high watermarks"), it was discovered that the memcg LRU can breach the thrashing protection imposed by min_ttl_ms. Before the memcg LRU: kswapd() shrink_node_memcgs() mem_cgroup_iter() inc_max_seq() // always hit a different memcg lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation After the memcg LRU: kswapd() shrink_many() restart: iterate the memcg LRU: inc_max_seq() // occasionally hit the same memcg if raced with lru_gen_rotate_memcg(): goto restart lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation Specifically, when the restart happens in shrink_many(), it needs to stick with the (memcg LRU) generation it began with. In other words, it should neither re-read memcg_lru->seq nor age an lruvec of a different generation. Otherwise it can hit the same memcg multiple times without giving lru_gen_age_node() a chance to check the timestamp of that memcg's oldest generation (against min_ttl_ms). [1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: try to stop at high watermarksGravatar Yu Zhao 1-8/+28
The initial MGLRU patchset didn't include the memcg LRU support, and it relied on should_abort_scan(), added by commit f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid overshooting their aggregate reclaim target by too much". Later on when the memcg LRU was added, should_abort_scan() was deemed unnecessary, and the test results [1] showed no side effects after it was removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard"). However, that test used memory.reclaim, which sets nr_to_reclaim to SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore that test isn't able to reproduce the worst case scenario, i.e., kswapd overshooting GBs on large systems and "consuming 100% CPU" (see the Closes tag). Bring back a simplified version of should_abort_scan() on top of the memcg LRU, so that kswapd stops when all eligible zones are above their respective high watermarks plus a small delta to lower the chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0 reclaim, meaning compaction-induced reclaim can still run wild (which is a different problem). On Android, launching 55 apps sequentially: Before After Change pgpgin 838377172 802955040 -4% pgpgout 38037080 34336300 -10% [1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12mm/mglru: fix underprotected page cacheGravatar Yu Zhao 1-1/+1
Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10mm/vmstat: move pgdemote_* to per-node statsGravatar Li Zhijian 1-4/+8
Demotion will migrate pages across nodes. Previously, only the global demotion statistics were accounted for. Changed them to per-node statistics, making it easier to observe where demotion occurs on each node. This will help to identify which nodes are under pressure. This patch also make pgdemote_* behind CONFIG_NUMA_BALANCING, since demotion is not available for !CONFIG_NUMA_BALANCING With this patch, here is a sample where node0 node1 are DRAM, node3 is PMEM: Global stats: $ grep demote /proc/vmstat pgdemote_kswapd 254288 pgdemote_direct 113497 pgdemote_khugepaged 0 Per-node stats: $ grep demote /sys/devices/system/node/node0/vmstat # demotion source pgdemote_kswapd 68454 pgdemote_direct 83431 pgdemote_khugepaged 0 $ grep demote /sys/devices/system/node/node1/vmstat # demotion source pgdemote_kswapd 185834 pgdemote_direct 30066 pgdemote_khugepaged 0 $ grep demote /sys/devices/system/node/node3/vmstat # demotion target pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20231103031450.1456523-1-lizhijian@fujitsu.com Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>