aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-02-22mm, page_alloc: don't convert pfn to idx when mergingGravatar Vlastimil Babka 3-23/+20
In __free_one_page() we do the buddy merging arithmetics on "page/buddy index", which is just the lower MAX_ORDER bits of pfn. The operations we do that affect the higher bits are bitwise AND and subtraction (in that order), where the final result will be the same with the higher bits left unmasked, as long as these bits are equal for both buddies - which must be true by the definition of a buddy. We can therefore use pfn's directly instead of "index" and skip the zeroing of >MAX_ORDER bits. This can help a bit by itself, although compiler might be smart enough already. It also helps the next patch to avoid page_to_pfn() for memory hole checks. Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: throttle show_mem() from warn_alloc()Gravatar Michal Hocko 1-11/+18
Tetsuo has been stressing OOM killer path with many parallel allocation requests when he has noticed that it is not all that hard to swamp kernel logs with warn_alloc messages caused by allocation stalls. Even though the allocation stall message is triggered only once in 10s there might be many different tasks hitting it roughly around the same time. A big part of the output is show_mem() which can generate a lot of output even on a small machines. There is no reason to show the state of memory counter for each allocation stall, especially when multiple of them are reported in a short time period. Chances are that not much has changed since the last report. This patch simply rate limits show_mem called from warn_alloc to only dump something once per second. This should be enough to give us a clue why an allocation might be stalling while burst of warnings will not swamp log with too much data. While we are at it, extract all the show_mem related handling (filters) into a separate function warn_alloc_show_mem. This will make the code cleaner and as a bonus point we can distinguish which part of warn_alloc got throttled due to rate limiting as ___ratelimit dumps the caller. [akpm@linux-foundation.org: reduce scope of the ratelimit_states] Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22tmpfs: change shmem_mapping() to test shmem_aopsGravatar Hugh Dickins 1-4/+1
Callers of shmem_mapping() are interested in whether the mapping is swap backed - except for uprobes, which is interested in whether it should use shmem_read_mapping_page(). All these callers are better served by a shmem_mapping() which checks for shmem_aops, than the current version which goes through several indirections to find where the inode lives - and has the surprising effect that a private mmap of /dev/zero satisfies both vma_is_anonymous() and shmem_mapping(), when that device node is on devtmpfs. I don't think anything in the tree suffers from that surprise, but it caught me out, and is better fixed. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052148530.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slub: make sysfs directories for memcg sub-caches optionalGravatar Tejun Heo 1-2/+24
SLUB creates a per-cache directory under /sys/kernel/slab which hosts a bunch of debug files. Usually, there aren't that many caches on a system and this doesn't really matter; however, if memcg is in use, each cache can have per-cgroup sub-caches. SLUB creates the same directories for these sub-caches under /sys/kernel/slab/$CACHE/cgroup. Unfortunately, because there can be a lot of cgroups, active or draining, the product of the numbers of caches, cgroups and files in each directory can reach a very high number - hundreds of thousands is commonplace. Millions and beyond aren't difficult to reach either. What's under /sys/kernel/slab is primarily for debugging and the information and control on the a root cache already cover its sub-caches. While having a separate directory for each sub-cache can be helpful for development, it doesn't make much sense to pay this amount of overhead by default. This patch introduces a boot parameter slub_memcg_sysfs which determines whether to create sysfs directories for per-memcg sub-caches. It also adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's default value and defaults to 0. [akpm@linux-foundation.org: kset_unregister(NULL) is legal] Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: use memcg_kmem_cache_wq for slab destruction operationsGravatar Tejun Heo 2-9/+9
If there's contention on slab_mutex, queueing the per-cache destruction work item on the system_wq can unnecessarily create and tie up a lot of kworkers. Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it global and use that workqueue for the destruction work items too. While at it, convert the workqueue from an unbound workqueue to a per-cpu one with concurrency limited to 1. It's generally preferable to use per-cpu workqueues and concurrency limit of 1 is safe enough. This is suggested by Joonsoo Kim. Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: remove slub sysfs interface files early for empty memcg cachesGravatar Tejun Heo 1-2/+23
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. Each cache has a number of sysfs interface files under /sys/kernel/slab. On a system with a lot of memory and transient memcgs, the number of interface files which have to be removed once memory reclaim kicks in can reach millions. Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: remove synchronous synchronize_sched() from memcg cache deactivation pathGravatar Tejun Heo 3-4/+70
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: introduce __kmemcg_cache_deactivate()Gravatar Tejun Heo 5-23/+36
__kmem_cache_shrink() is called with %true @deactivate only for memcg caches. Remove @deactivate from __kmem_cache_shrink() and introduce __kmemcg_cache_deactivate() instead. Each memcg-supporting allocator should implement it and it should deactivate and drain the cache. This is to allow memcg cache deactivation behavior to further deviate from simple shrinking without messing up __kmem_cache_shrink(). This is pure reorganization and doesn't introduce any observable behavior changes. v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir. Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: implement slab_root_caches listGravatar Tejun Heo 3-25/+50
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slab_caches currently lists all caches including root and memcg ones. This is the only data structure which lists the root caches and iterating root caches can only be done by walking the list while skipping over memcg caches. As there can be a huge number of memcg caches, this can become very expensive. This also can make /proc/slabinfo behave very badly. seq_file processes reads in 4k chunks and seeks to the previous Nth position on slab_caches list to resume after each chunk. With a lot of memcg cache churns on the list, reading /proc/slabinfo can become very slow and its content often ends up with duplicate and/or missing entries. This patch adds a new list slab_root_caches which lists only the root caches. When memcg is not enabled, it becomes just an alias of slab_caches. memcg specific list operations are collected into memcg_[un]link_cache(). Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: link memcg kmem_caches on their associated memory cgroupGravatar Tejun Heo 3-10/+36
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. While a memcg kmem_cache is listed on its root cache's ->children list, there is no direct way to iterate all kmem_caches which are assocaited with a memory cgroup. The only way to iterate them is walking all caches while filtering out caches which don't match, which would be most of them. This makes memcg destruction operations O(N^2) where N is the total number of slab caches which can be huge. This combined with the synchronous RCU operations can tie up a CPU and affect the whole machine for many hours when memory reclaim triggers offlining and destruction of the stale memcgs. This patch adds mem_cgroup->kmem_caches list which goes through memcg_cache_params->kmem_caches_node of all kmem_caches which are associated with the memcg. All memcg specific iterations, including stat file access, are updated to use the new list instead. Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: reorganize memcg_cache_paramsGravatar Tejun Heo 2-15/+16
We're going to change how memcg caches are iterated. In preparation, clean up and reorganize memcg_cache_params. * The shared ->list is replaced by ->children in root and ->children_node in children. * ->is_root_cache is removed. Instead ->root_cache is moved out of the child union and now used by both root and children. NULL indicates root cache. Non-NULL a memcg one. This patch doesn't cause any observable behavior changes. Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: remove synchronous rcu_barrier() call in memcg cache release pathGravatar Tejun Heo 1-42/+60
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before destruction because slab pages are freed through RCU and they need to be able to dereference the associated kmem_cache. Currently, it's done synchronously with rcu_barrier(). As rcu_barrier() is expensive time-wise, slab implements a batching mechanism so that rcu_barrier() can be done for multiple caches at the same time. Unfortunately, the rcu_barrier() is in synchronous path which is called while holding cgroup_mutex and the batching is too limited to be actually helpful. This patch updates the cache release path so that the batching is asynchronous and global. All SLAB_DESTORY_BY_RCU caches are queued globally and a work item consumes the list. The work item calls rcu_barrier() only once for all caches that are currently queued. * release_caches() is removed and shutdown_cache() now either directly release the cache or schedules a RCU callback to do that. This makes the cache inaccessible once shutdown_cache() is called and makes it impossible for shutdown_memcg_caches() to do memcg-specific cleanups afterwards. Move memcg-specific part into a helper, unlink_memcg_cache(), and make shutdown_cache() call it directly. Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slub: separate out sysfs_slab_release() from sysfs_slab_remove()Gravatar Tejun Heo 2-3/+11
Separate out slub sysfs removal and release, and call the former earlier from __kmem_cache_shutdown(). There's no reason to defer sysfs removal through RCU and this will later allow us to remove sysfs files way earlier during memory cgroup offline instead of release. Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22Revert "slub: move synchronize_sched out of slab_mutex on shrink"Gravatar Tejun Heo 5-31/+23
Patch series "slab: make memcg slab destruction scalable", v3. With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. I've seen machines which end up with hundred thousands of caches and many millions of kernfs_nodes. The current code is O(N^2) on the total number of caches and has synchronous rcu_barrier() and synchronize_sched() in cgroup offline / release path which is executed while holding cgroup_mutex. Combined, this leads to very expensive and slow cache destruction operations which can easily keep running for half a day. This also messes up /proc/slabinfo along with other cache iterating operations. seq_file operates on 4k chunks and on each 4k boundary tries to seek to the last position in the list. With a huge number of caches on the list, this becomes very slow and very prone to the list content changing underneath it leading to a lot of missing and/or duplicate entries. This patchset addresses the scalability problem. * Add root and per-memcg lists. Update each user to use the appropriate list. * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched and asynchronous. * For dying empty slub caches, remove the sysfs files after deactivation so that we don't end up with millions of sysfs files without any useful information on them. This patchset contains the following nine patches. 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch 0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch 0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch 0004-slab-reorganize-memcg_cache_params.patch 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch 0006-slab-implement-slab_root_caches-list.patch 0007-slab-introduce-__kmemcg_cache_deactivate.patch 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch 0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch 0001 reverts an existing optimization to prepare for the following changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release path batched and asynchronous. 0004-0006 separate out the lists. 0007-0008 replace synchronize_sched() in slub destruction path with call_rcu_sched(). 0009 removes sysfs files early for empty dying caches. 0010 makes destruction work items use a workqueue with limited concurrency. This patch (of 10): Revert 89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on shrink"). With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. Moving synchronize_sched() out of slab_mutex isn't enough as it's still inside cgroup_mutex. The whole deactivation / release path will be updated to avoid all synchronous RCU operations. Revert this insufficient optimization in preparation to ease future changes. Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, slab: rename kmalloc-node cache to kmalloc-<size>Gravatar Vlastimil Babka 3-5/+9
SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit the kmem_cache_node management structure, and puts it into the generic kmalloc cache array (e.g. for 128b objects). The name of this cache is "kmalloc-node", which is confusing for readers of /proc/slabinfo as the cache is used for generic allocations (and not just the kmem_cache_node struct) and it appears as the kmalloc-128 cache is missing. An easy solution is to use the kmalloc-<size> name when pre-creating the cache, which we can get from the kmalloc_info array. Example /proc/slabinfo before the patch: ... kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828 kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133 kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219 kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715 kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346 kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53 kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0 After the patch: ... kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807 kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203 kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116 kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785 kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270 kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39 kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0 [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter] Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/slub: add a dump_stack() to the unexpected GFP checkGravatar Borislav Petkov 1-0/+1
We wish to know who is doing such a thing. slab.c does this. Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slub: do not merge cache if slub_debug contains a never-merge flagGravatar Grygorii Maistrenko 1-1/+4
In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from commandline but never checks if there are features from the SLAB_NEVER_MERGE set. As a result selected by slub_debug caches are always mergeable if they have been created without a custom constructor set or without one of the SLAB_* debug features on. This moves the SLAB_NEVER_MERGE check below the flags update from commandline to make sure it won't merge the slab cache if one of the debug features is on. Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d Signed-off-by: Grygorii Maistrenko <grygoriimkd@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: change pmd_fault() to take only vmf parameterGravatar Dave Jiang 1-5/+4
pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: make pmd_fault() and friends be the same as fault()Gravatar Dave Jiang 1-4/+2
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22Merge tag 'arm64-upstream' of ↵Gravatar Linus Torvalds 3-19/+15
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: - Errata workarounds for Qualcomm's Falkor CPU - Qualcomm L2 Cache PMU driver - Qualcomm SMCCC firmware quirk - Support for DEBUG_VIRTUAL - CPU feature detection for userspace via MRS emulation - Preliminary work for the Statistical Profiling Extension - Misc cleanups and non-critical fixes * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits) arm64/kprobes: consistently handle MRS/MSR with XZR arm64: cpufeature: correctly handle MRS to XZR arm64: traps: correctly handle MRS/MSR with XZR arm64: ptrace: add XZR-safe regs accessors arm64: include asm/assembler.h in entry-ftrace.S arm64: fix warning about swapper_pg_dir overflow arm64: Work around Falkor erratum 1003 arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2 arm64: arch_timer: document Hisilicon erratum 161010101 arm64: use is_vmalloc_addr arm64: use linux/sizes.h for constants arm64: uaccess: consistently check object sizes perf: add qcom l2 cache perf events driver arm64: remove wrong CONFIG_PROC_SYSCTL ifdef ARM: smccc: Update HVC comment to describe new quirk parameter arm64: do not trace atomic operations ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device() ACPI/IORT: Fix iort_node_get_id() mapping entries indexing arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA perf: xgene: Include module.h ...
2017-02-17Merge branch 'for-4.11/next' into for-4.11/linus-mergeGravatar Jens Axboe 2-5/+42
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08mm/slub.c: fix random_seq offset destructionGravatar Sean Rees 1-0/+4
Commit 210e7a43fa90 ("mm: SLUB freelist randomization") broke USB hub initialisation as described in https://bugzilla.kernel.org/show_bug.cgi?id=177551. Bail out early from init_cache_random_seq if s->random_seq is already initialised. This prevents destroying the previously computed random_seq offsets later in the function. If the offsets are destroyed, then shuffle_freelist will truncate page->freelist to just the first object (orphaning the rest). Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization") Link: http://lkml.kernel.org/r/20170207140707.20824-1-sean@erifax.org Signed-off-by: Sean Rees <sean@erifax.org> Reported-by: <userwithuid@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Garnier <thgarnie@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-08block: fix double-free in the failure path of cgwb_bdi_init()Gravatar Tejun Heo 1-2/+7
When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested at bdi->wb_congested. cgwb_bdi_init() allocates it with kzalloc() and doesn't do further initialization. This usually works fine as the reference count gets bumped to 1 by wb_init() and the put from wb_exit() releases it. However, when wb_init() fails, it puts the wb base ref automatically freeing the wb and the explicit kfree() in cgwb_bdi_init() error path ends up trying to free the same pointer the second time causing a double-free. Fix it by explicitly initilizing the refcnt to 1 and putting the base ref from cgwb_bdi_destroy(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-03mm, fs: check for fatal signals in do_generic_file_read()Gravatar Michal Hocko 1-0/+5
do_generic_file_read() can be told to perform a large request from userspace. If the system is under OOM and the reading task is the OOM victim then it has an access to memory reserves and finishing the full request can lead to the full memory depletion which is dangerous. Make sure we rather go with a short read and allow the killed task to terminate. Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03base/memory, hotplug: fix a kernel oops in show_valid_zones()Gravatar Toshi Kani 1-5/+15
Reading a sysfs "memoryN/valid_zones" file leads to the following oops when the first page of a range is not backed by struct page. show_valid_zones() assumes that 'start_pfn' is always valid for page_zone(). BUG: unable to handle kernel paging request at ffffea017a000000 IP: show_valid_zones+0x6f/0x160 This issue may happen on x86-64 systems with 64GiB or more memory since their memory block size is bumped up to 2GiB. [1] An example of such systems is desribed below. 0x3240000000 is only aligned by 1GiB and this memory block starts from 0x3200000000, which is not backed by struct page. BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable Since test_pages_in_a_zone() already checks holes, fix this issue by extending this function to return 'valid_start' and 'valid_end' for a given range. show_valid_zones() then proceeds with the valid range. [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")' Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()Gravatar Toshi Kani 1-4/+8
Patch series "fix a kernel oops when reading sysfs valid_zones", v2. A sysfs memory file is created for each 2GiB memory block on x86-64 when the system has 64GiB or more memory. [1] When the start address of a memory block is not backed by struct page, i.e. a memory range is not aligned by 2GiB, reading its 'valid_zones' attribute file leads to a kernel oops. This issue was observed on multiple x86-64 systems with more than 64GiB of memory. This patch-set fixes this issue. Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not test the start section. Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone() to return valid [start, end). Note for stable kernels: The memory block size change was made by commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems"), which was accepted to 3.9. However, this patch-set depends on (and fixes) the change to test_pages_in_a_zone() made by commit 5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()"), which was accepted to 4.4. So, I recommend that we backport it up to 4.4. [1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")' This patch (of 2): test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by section since 'sec_end_pfn' is set equal to 'pfn'. Since this function is called for testing the range of a sysfs memory file, 'start_pfn' is always aligned by section. Fix it by properly setting 'sec_end_pfn' to the next section pfn. Also make sure that this function returns 1 only when the range belongs to a zone. Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03shmem: fix sleeping from atomic contextGravatar Kirill A. Shutemov 1-2/+9
Syzkaller fuzzer managed to trigger this: BUG: sleeping function called from invalid context at mm/shmem.c:852 in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged 3 locks held by khugepaged/529: #0: (shrinker_rwsem){++++..}, at: [<ffffffff818d7ef1>] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451 #1: (&type->s_umount_key#29){++++..}, at: [<ffffffff81a63630>] trylock_super+0x20/0x100 fs/super.c:392 #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] spin_lock include/linux/spinlock.h:302 [inline] #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427 CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: shmem_undo_range+0xb20/0x2710 mm/shmem.c:852 shmem_truncate_range+0x27/0xa0 mm/shmem.c:939 shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030 evict+0x46e/0x980 fs/inode.c:553 iput_final fs/inode.c:1515 [inline] iput+0x589/0xb20 fs/inode.c:1542 shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446 shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512 super_cache_scan+0x376/0x450 fs/super.c:106 do_shrink_slab mm/vmscan.c:378 [inline] shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481 shrink_slab mm/vmscan.c:2592 [inline] shrink_node+0x2c7/0x870 mm/vmscan.c:2592 shrink_zones mm/vmscan.c:2734 [inline] do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776 try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982 __perform_reclaim mm/page_alloc.c:3301 [inline] __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline] __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683 __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848 __alloc_pages include/linux/gfp.h:426 [inline] __alloc_pages_node include/linux/gfp.h:439 [inline] khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750 collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955 khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208 khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline] khugepaged_do_scan mm/khugepaged.c:1808 [inline] khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 The iput() from atomic context was a bad idea: if after igrab() somebody else calls iput() and we left with the last inode reference, our iput() would lead to inode eviction and therefore sleeping. This patch should fix the situation. Link: http://lkml.kernel.org/r/20170131093141.GA15899@node.shutemov.name Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03kasan: respect /proc/sys/kernel/traceoff_on_warningGravatar Peter Zijlstra 1-0/+3
After much waiting I finally reproduced a KASAN issue, only to find my trace-buffer empty of useful information because it got spooled out :/ Make kasan_report honour the /proc/sys/kernel/traceoff_on_warning interface. Link: http://lkml.kernel.org/r/20170125164106.3514-1-aryabinin@virtuozzo.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03zswap: disable changing params if init failsGravatar Dan Streetman 1-1/+29
Add zswap_init_failed bool that prevents changing any of the module params, if init_zswap() fails, and set zswap_enabled to false. Change 'enabled' param to a callback, and check zswap_init_failed before allowing any change to 'enabled', 'zpool', or 'compressor' params. Any driver that is built-in to the kernel will not be unloaded if its init function returns error, and its module params remain accessible for users to change via sysfs. Since zswap uses param callbacks, which assume that zswap has been initialized, changing the zswap params after a failed initialization will result in WARNING due to the param callbacks expecting a pool to already exist. This prevents that by immediately exiting any of the param callbacks if initialization failed. This was reported here: https://marc.info/?l=linux-mm&m=147004228125528&w=4 And fixes this WARNING: [ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60 The warning is just noise, and not serious. However, when init fails, zswap frees all its percpu dstmem pages and its kmem cache. The kmem cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but the percpu dstmem pages are definitely a problem, as they're used as temporary buffer for compressed pages before copying into place in the zpool. If the user does get zswap enabled after an init failure, then zswap will likely Oops on the first page it tries to compress (or worse, start corrupting memory). Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime") Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reported-by: Marcin Miroslaw <marcin@mejor.pl> Cc: Seth Jennings <sjenning@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-02block: Dynamically allocate and refcount backing_dev_infoGravatar Jan Kara 1-1/+33
Instead of storing backing_dev_info inside struct request_queue, allocate it dynamically, reference count it, and free it when the last reference is dropped. Currently only request_queue holds the reference but in the following patch we add other users referencing backing_dev_info. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02block: Use pointer to backing_dev_info from request_queueGravatar Jan Kara 1-2/+2
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-24mm, page_alloc: fix premature OOM when racing with cpuset mems updateGravatar Vlastimil Babka 1-11/+24
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers OOM killer in few seconds, despite lots of free memory. The test attempts to repeatedly fault in memory in one process in a cpuset, while changing allowed nodes of the cpuset between 0 and 1 in another process. The problem comes from insufficient protection against cpuset changes, which can cause get_page_from_freelist() to consider all zones as non-eligible due to nodemask and/or current->mems_allowed. This was masked in the past by sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") we fix the preferred_zoneref once, and don't iterate over the whole zonelist in further attempts, thus the only eligible zones might be placed in the zonelist before our starting point and we always miss them. A previous patch fixed this problem for current->mems_allowed. However, cpuset changes also update the task's mempolicy nodemask. The fix has two parts. We have to repeat the preferred_zoneref search when we detect cpuset update by way of seqcount, and we have to check the seqcount before considering OOM. [akpm@linux-foundation.org: fix typo in comment] Link: http://lkml.kernel.org/r/20170120103843.24587-5-vbabka@suse.cz Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm, page_alloc: move cpuset seqcount checking to slowpathGravatar Vlastimil Babka 1-21/+26
This is a preparation for the following patch to make review simpler. While the primary motivation is a bug fix, this also simplifies the fast path, although the moved code is only enabled when cpusets are in use. Link: http://lkml.kernel.org/r/20170120103843.24587-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm, page_alloc: fix fast-path race with cpuset update or removalGravatar Vlastimil Babka 1-1/+9
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers OOM killer in few seconds, despite lots of free memory. The test attempts to repeatedly fault in memory in one process in a cpuset, while changing allowed nodes of the cpuset between 0 and 1 in another process. One possible cause is that in the fast path we find the preferred zoneref according to current mems_allowed, so that it points to the middle of the zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is updated to contain only node 1, we never reach it in the zonelist, and trigger OOM before checking the cpuset_mems_cookie. This patch fixes the particular case by redoing the preferred zoneref search if we switch back to the original nodemask. The condition is also slightly changed so that when the last non-root cpuset is removed, we don't miss it. Note that this is not a full fix, and more patches will follow. Link: http://lkml.kernel.org/r/20170120103843.24587-3-vbabka@suse.cz Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm, page_alloc: fix check for NULL preferred_zoneGravatar Vlastimil Babka 1-1/+1
Patch series "fix premature OOM regression in 4.7+ due to cpuset races". This is v2 of my attempt to fix the recent report based on LTP cpuset stress test [1]. The intention is to go to stable 4.9 LTSS with this, as triggering repeated OOMs is not nice. That's why the patches try to be not too intrusive. Unfortunately why investigating I found that modifying the testcase to use per-VMA policies instead of per-task policies will bring the OOM's back, but that seems to be much older and harder to fix problem. I have posted a RFC [2] but I believe that fixing the recent regressions has a higher priority. Longer-term we might try to think how to fix the cpuset mess in a better and less error prone way. I was for example very surprised to learn, that cpuset updates change not only task->mems_allowed, but also nodemask of mempolicies. Until now I expected the parameter to alloc_pages_nodemask() to be stable. I wonder why do we then treat cpusets specially in get_page_from_freelist() and distinguish HARDWALL etc, when there's unconditional intersection between mempolicy and cpuset. I would expect the nodemask adjustment for saving overhead in g_p_f(), but that clearly doesn't happen in the current form. So we have both crazy complexity and overhead, AFAICS. [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz This patch (of 4): Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") we have a wrong check for NULL preferred_zone, which can theoretically happen due to concurrent cpuset modification. We check the zoneref pointer which is never NULL and we should check the zone pointer. Also document this in first_zones_zonelist() comment per Michal Hocko. Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/mempolicy.c: do not put mempolicy before using its nodemaskGravatar Vlastimil Babka 1-1/+1
Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by mpol_cond_put() before accessing the embedded nodemask by __alloc_pages_nodemask(). The commit log says it's so "we can use a single exit path within the function" but that's clearly wrong. We can still do that when doing mpol_cond_put() after the allocation attempt. Make sure the mempolicy is not freed prematurely, otherwise __alloc_pages_nodemask() can end up using a bogus nodemask, which could lead e.g. to premature OOM. Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm, memcg: do not retry precharge chargesGravatar David Rientjes 1-2/+2
When memory.move_charge_at_immigrate is enabled and precharges are depleted during move, mem_cgroup_move_charge_pte_range() will attempt to increase the size of the precharge. Prevent precharges from ever looping by setting __GFP_NORETRY. This was probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is pointless as written. Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path") Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm: alloc_contig: re-allow CMA to compact FS pagesGravatar Lucas Stach 1-0/+1
Commit 73e64c51afc5 ("mm, compaction: allow compaction for GFP_NOFS requests") changed compation to skip FS pages if not explicitly allowed to touch them, but missed to update the CMA compact_control. This leads to a very high isolation failure rate, crippling performance of CMA even on a lightly loaded system. Re-allow CMA to compact FS pages by setting the correct GFP flags, restoring CMA behavior and performance to the kernel 4.9 level. Fixes: 73e64c51afc5 (mm, compaction: allow compaction for GFP_NOFS requests) Link: http://lkml.kernel.org/r/20170113115155.24335-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/slub.c: trace free objects at KERN_INFOGravatar Daniel Thompson 1-10/+13
Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thpGravatar Keno Fischer 1-1/+17
In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com Signed-off-by: Keno Fischer <keno@juliacomputing.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24memory_hotplug: make zone_can_shift() return a boolean valueGravatar Yasuaki Ishimatsu 1-11/+17
online_{kernel|movable} is used to change the memory zone to ZONE_{NORMAL|MOVABLE} and online the memory. To check that memory zone can be changed, zone_can_shift() is used. Currently the function returns minus integer value, plus integer value and 0. When the function returns minus or plus integer value, it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}. But when the function returns 0, there are two meanings. One of the meanings is that the memory zone does not need to be changed. For example, when memory is in ZONE_NORMAL and onlined by online_kernel the memory zone does not need to be changed. Another meaning is that the memory zone cannot be changed. When memory is in ZONE_NORMAL and onlined by online_movable, the memory zone may not be changed to ZONE_MOVALBE due to memory online limitation(see Documentation/memory-hotplug.txt). In this case, memory must not be onlined. The patch changes the return type of zone_can_shift() so that memory online operation fails when memory zone cannot be changed as follows: Before applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 8388608 managed 8388608 online_movable operation succeeded. But memory is onlined as ZONE_NORMAL, not ZONE_MOVABLE. After applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state bash: echo: write error: Invalid argument # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 online_movable operation failed because of failure of changing the memory zone from ZONE_NORMAL to ZONE_MOVABLE Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online") Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-12Merge branch 'aarch64/for-next/debug-virtual' into aarch64/for-next/coreGravatar Will Deacon 3-19/+15
Merge core DEBUG_VIRTUAL changes from Laura Abbott. Later arm and arm64 support depends on these. * aarch64/for-next/debug-virtual: drivers: firmware: psci: Use __pa_symbol for kernel symbol mm/usercopy: Switch to using lm_alias mm/kasan: Switch to using __pa_symbol and lm_alias kexec: Switch to __pa_symbol mm: Introduce lm_alias mm/cma: Cleanup highmem check lib/Kconfig.debug: Add ARCH_HAS_DEBUG_VIRTUAL
2017-01-11mm/usercopy: Switch to using lm_aliasGravatar Laura Abbott 1-2/+2
The usercopy checking code currently calls __va(__pa(...)) to check for aliases on symbols. Switch to using lm_alias instead. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-11mm/kasan: Switch to using __pa_symbol and lm_aliasGravatar Laura Abbott 1-7/+8
__pa_symbol is the correct API to find the physical address of symbols. Switch to it to allow for debugging APIs to work correctly. Other functions such as p*d_populate may call __pa internally. Ensure that the address passed is in the linear region by calling lm_alias. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-11mm/cma: Cleanup highmem checkGravatar Laura Abbott 1-10/+5
6b101e2a3ce4 ("mm/CMA: fix boot regression due to physical address of high_memory") added checks to use __pa_nodebug on x86 since CONFIG_DEBUG_VIRTUAL complains about high_memory not being linearlly mapped. arm64 is now getting support for CONFIG_DEBUG_VIRTUAL as well. Rather than add an explosion of arches to the #ifdef, switch to an alternate method to calculate the physical start of highmem using the page before highmem starts. This avoids the need for the #ifdef and extra __pa_nodebug calls. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-10mm/hugetlb.c: fix reservation race when freeing surplus pagesGravatar Mike Kravetz 1-9/+28
return_unused_surplus_pages() decrements the global reservation count, and frees any unused surplus pages that were backing the reservation. Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") added a call to cond_resched_lock in the loop freeing the pages. As a result, the hugetlb_lock could be dropped, and someone else could use the pages that will be freed in subsequent iterations of the loop. This could result in inconsistent global hugetlb page state, application api failures (such as mmap) failures or application crashes. When dropping the lock in return_unused_surplus_pages, make sure that the global reservation count (resv_huge_pages) remains sufficiently large to prevent someone else from claiming pages about to be freed. Analyzed by Paul Cassella. Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Paul Cassella <cassella@cray.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10mm/slab.c: fix SLAB freelist randomization duplicate entriesGravatar John Sperbeck 1-4/+4
This patch fixes a bug in the freelist randomization code. When a high random number is used, the freelist will contain duplicate entries. It will result in different allocations sharing the same chunk. It will result in odd behaviours and crashes. It should be uncommon but it depends on the machines. We saw it happening more often on some machines (every few hours of running tests). Fixes: c7ce4f60ac19 ("mm: SLAB freelist randomization") Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.com Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10mm: support anonymous stable pageGravatar Minchan Kim 1-1/+19
During developemnt for zram-swap asynchronous writeback, I found strange corruption of compressed page, resulting in: Modules linked in: zram(E) CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff88007620b840 task.stack: ffff880078090000 RIP: set_freeobj.part.43+0x1c/0x1f RSP: 0018:ffff880078093ca8 EFLAGS: 00010246 RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8 RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246 RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88 R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0 Call Trace: obj_malloc+0x22b/0x260 zs_malloc+0x1e4/0x580 zram_bvec_rw+0x4cd/0x830 [zram] page_requests_rw+0x9c/0x130 [zram] zram_thread+0xe6/0x173 [zram] kthread+0xca/0xe0 ret_from_fork+0x25/0x30 With investigation, it reveals currently stable page doesn't support anonymous page. IOW, reuse_swap_page can reuse the page without waiting writeback completion so it can overwrite page zram is compressing. Unfortunately, zram has used per-cpu stream feature from v4.7. It aims for increasing cache hit ratio of scratch buffer for compressing. Downside of that approach is that zram should ask memory space for compressed page in per-cpu context which requires stricted gfp flag which could be failed. If so, it retries to allocate memory space out of per-cpu context so it could get memory this time and compress the data again, copies it to the memory space. In this scenario, zram assumes the data should never be changed but it is not true unless stable page supports. So, If the data is changed under us, zram can make buffer overrun because second compression size could be bigger than one we got in previous trial and blindly, copy bigger size object to smaller buffer which is buffer overrun. The overrun breaks zsmalloc free object chaining so system goes crash like above. I think below is same problem. https://bugzilla.suse.com/show_bug.cgi?id=997574 Unfortunately, reuse_swap_page should be atomic so that we cannot wait on writeback in there so the approach in this patch is simply return false if we found it needs stable page. Although it increases memory footprint temporarily, it happens rarely and it should be reclaimed easily althoug it happened. Also, It would be better than waiting of IO completion, which is critial path for application latency. Fixes: da9556a2367c ("zram: user per-cpu compression streams") Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hyeoncheol Lee <cheol.lee@lge.com> Cc: <yjay.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10mm: rename __page_frag functions to __page_frag_cache, drop order from drainGravatar Alexander Duyck 1-6/+7
This patch does two things. First it goes through and renames the __page_frag prefixed functions to __page_frag_cache so that we can be clear that we are draining or refilling the cache, not the frags themselves. Second we drop the order parameter from __page_frag_cache_drain since we don't actually need to pass it since all fragments are either order 0 or must be a compound page. Link: http://lkml.kernel.org/r/20170104023954.13451.5678.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to ↵Gravatar Alexander Duyck 1-5/+5
page_frag_free Patch series "Page fragment updates", v4. This patch series takes care of a few cleanups for the page fragments API. First we do some renames so that things are much more consistent. First we move the page_frag_ portion of the name to the front of the functions names. Secondly we split out the cache specific functions from the other page fragment functions by adding the word "cache" to the name. Finally I added a bit of documentation that will hopefully help to explain some of this. I plan to revisit this later as we get things more ironed out in the near future with the changes planned for the DMA setup to support eXpress Data Path. This patch (of 3): This patch renames the page frag functions to be more consistent with other APIs. Specifically we place the name page_frag first in the name and then have either an alloc or free call name that we append as the suffix. This makes it a bit clearer in terms of naming. In addition we drop the leading double underscores since we are technically no longer a backing interface and instead the front end that is called from the networking APIs. Link: http://lkml.kernel.org/r/20170104023854.13451.67390.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>