aboutsummaryrefslogtreecommitdiff
path: root/ipc
AgeCommit message (Collapse)AuthorFilesLines
4 hoursMerge tag 'sysctl-6.10-rc1' of ↵Gravatar Linus Torvalds 2-6/+4
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove sentinel elements from ctl_table structs in kernel/* Removing sentinels in ctl_table arrays reduces the build time size and runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline through their respective subsystems making the next release the most likely place where the final series that removes the check for proc_name == NULL will land. This adds to removals already in arch/, drivers/ and fs/. - Adjust ctl_table definitions and references to allow constification - Remove unused ctl_table function arguments - Move non-const elements from ctl_table to ctl_table_header - Make ctl_table pointers const in ctl_table_root structure Making the static ctl_table structs const will increase safety by keeping the pointers to proc_handler functions in .rodata. Though no ctl_tables where made const in this PR, the ground work for making that possible has started with these changes sent by Thomas Weißschuh. * tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: drop now unnecessary out-of-bounds check sysctl: move sysctl type to ctl_table_header sysctl: drop sysctl_is_perm_empty_ctl_table sysctl: treewide: constify argument ctl_table_root::permissions(table) sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) bpf: Remove the now superfluous sentinel elements from ctl_table array delayacct: Remove the now superfluous sentinel elements from ctl_table array kprobes: Remove the now superfluous sentinel elements from ctl_table array printk: Remove the now superfluous sentinel elements from ctl_table array scheduler: Remove the now superfluous sentinel elements from ctl_table array seccomp: Remove the now superfluous sentinel elements from ctl_table array timekeeping: Remove the now superfluous sentinel elements from ctl_table array ftrace: Remove the now superfluous sentinel elements from ctl_table array umh: Remove the now superfluous sentinel elements from ctl_table array kernel misc: Remove the now superfluous sentinel elements from ctl_table array
3 daysMerge tag 'net-next-6.10' of ↵Gravatar Linus Torvalds 2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code: - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter: - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF: - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API: - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling: - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers: - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support" * tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits) selftests: netfilter: fix packetdrill conntrack testcase net: gro: fix napi_gro_cb zeroed alignment Bluetooth: btintel_pcie: Refactor and code cleanup Bluetooth: btintel_pcie: Fix warning reported by sparse Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config Bluetooth: btintel_pcie: Fix compiler warnings Bluetooth: btintel_pcie: Add *setup* function to download firmware Bluetooth: btintel_pcie: Add support for PCIe transport Bluetooth: btintel: Export few static functions Bluetooth: HCI: Remove HCI_AMP support Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init() Bluetooth: qca: Fix error code in qca_read_fw_build_info() Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning Bluetooth: btintel: Add support for Filmore Peak2 (BE201) Bluetooth: btintel: Add support for BlazarI LE Create Connection command timeout increased to 20 secs dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth Bluetooth: compute LE flow credits based on recvbuf space Bluetooth: hci_sync: Use cmd->num_cis instead of magic number ...
2024-04-24sysctl: treewide: constify argument ctl_table_root::permissions(table)Gravatar Thomas Weißschuh 2-2/+2
The permissions callback should not modify the ctl_table. Enforce this expectation via the typesystem. This is a step to put "struct ctl_table" into .rodata throughout the kernel. The patch was created with the following coccinelle script: @@ identifier func, head, ctl; @@ int func( struct ctl_table_header *head, - struct ctl_table *ctl) + const struct ctl_table *ctl) { ... } (insert_entry() from fs/proc/proc_sysctl.c is a false-positive) No additional occurrences of '.permissions =' were found after a tree-wide search for places missed by the conccinelle script. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)Gravatar Thomas Weißschuh 2-4/+2
Remove the 'table' argument from set_ownership as it is never used. This change is a step towards putting "struct ctl_table" into .rodata and eventually having sysctl core only use "const struct ctl_table". The patch was created with the following coccinelle script: @@ identifier func, head, table, uid, gid; @@ void func( struct ctl_table_header *head, - struct ctl_table *table, kuid_t *uid, kgid_t *gid) { ... } No additional occurrences of 'set_ownership' were found after doing a tree-wide search. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-22sysctl: treewide: constify ctl_table_header::ctl_table_argGravatar Thomas Weißschuh 2-2/+2
To be able to constify instances of struct ctl_tables it is necessary to remove ways through which non-const versions are exposed from the sysctl core. One of these is the ctl_table_arg member of struct ctl_table_header. Constify this reference as a prerequisite for the full constification of struct ctl_table instances. No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-09fs: Add FOP_HUGE_PAGESGravatar Matthew Wilcox (Oracle) 1-7/+3
Instead of checking for specific file_operations, add a bit to file_operations which denotes a file that only contain hugetlb pages. This lets us make hugetlbfs_file_operations static, and removes is_file_shm_hugepages() completely. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240407201122.3783877-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-18Merge tag 'sysctl-6.9-rc1' of ↵Gravatar Linus Torvalds 1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: "No functional changes - additional testing is required for the rest of the pending changes. - New shared repo for sysctl maintenance - check-sysctl-docs adjustment for API changes by Thomas Weißschuh" * tag 'sysctl-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: scripts: check-sysctl-docs: handle per-namespace sysctls ipc: remove linebreaks from arguments of __register_sysctl_table scripts: check-sysctl-docs: adapt to new API MAINTAINERS: Update sysctl tree location
2024-02-23ipc: remove linebreaks from arguments of __register_sysctl_tableGravatar Thomas Weißschuh 1-2/+1
Calls to __register_sysctl_table will be validated by scripts/check-sysctl-docs. As this script is line-based remove the linebreak which would confuse the script. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-02-22sysctl: allow to change limits for posix messages queuesGravatar Alexey Gladkov 1-0/+36
All parameters of posix messages queues (queues_max/msg_max/msgsize_max) end up being limited by RLIMIT_MSGQUEUE. The code in mqueue_get_inode is where that limiting happens. The RLIMIT_MSGQUEUE is bound to the user namespace and is counted hierarchically. We can allow root in the user namespace to modify the posix messages queues parameters. Link: https://lkml.kernel.org/r/6ad67f23d1459a4f4339f74aa73bac0ecf3995e1.1705333426.git.legion@kernel.org Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/7eb21211c8622e91d226e63416b1b93c079f60ee.1663756794.git.legion@kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Joel Granados <joel.granados@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22sysctl: allow change system v ipc sysctls inside ipc namespaceGravatar Alexey Gladkov 1-2/+35
Patch series "Allow to change ipc/mq sysctls inside ipc namespace", v3. Right now ipc and mq limits count as per ipc namespace, but only real root can change them. By default, the current values of these limits are such that it can only be reduced. Since only root can change the values, it is impossible to reduce these limits in the rootless container. We can allow limit changes within ipc namespace because mq parameters are limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything other than cgroups. This patch (of 3): Rootless containers are not allowed to modify kernel IPC parameters. All default limits are set to such high values that in fact there are no limits at all. All limits are not inherited and are initialized to default values when a new ipc_namespace is created. For new ipc_namespace: size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24)) size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24)) int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15) int ipc_ns.shm_rmid_forced = 0; unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192 unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000 unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384 The shm_tot (total amount of shared pages) has also ceased to be global, it is located in ipc_namespace and is not inherited from anywhere. In such conditions, it cannot be said that these limits limit anything. The real limiter for them is cgroups. If we allow rootless containers to change these parameters, then it can only be reduced. Link: https://lkml.kernel.org/r/cover.1705333426.git.legion@kernel.org Link: https://lkml.kernel.org/r/d2f4603305cbfed58a24755aa61d027314b73a45.1705333426.git.legion@kernel.org Signed-off-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Granados <joel.granados@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-20shm: Slim down dependenciesGravatar Kent Overstreet 1-0/+1
list_head is in types.h, not list.h., and the uapi header wasn't needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20sched.h: move pid helpers to pid.hGravatar Kent Overstreet 1-0/+1
This is needed for killing the sched.h dependency on rcupdate.h, and pid.h is a better place for this code anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Gravatar Linus Torvalds 1-13/+8
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-25mempolicy: alloc_pages_mpol() for NUMA policy without vmaGravatar Hugh Dickins 1-13/+8
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the principal actor for passing mempolicy choice down to __alloc_pages(), rather than vma_alloc_folio(gfp, order, vma, addr, hugepage). vma_alloc_folio() and alloc_pages() remain, but as wrappers around alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the additional args to policy_nodemask(), which subsumes policy_node(). Cleanup throughout, cutting out some unhelpful "helpers". It would all be much simpler without MPOL_INTERLEAVE, but that adds a dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd ("tmpfs: distribute interleave better across nodes"), which added ino bias to the interleave, hidden from mm/mempolicy.c until this commit. Hence "ilx" throughout, the "interleave index". Originally I thought it could be done just with nid, but that's wrong: the nodemask may come from the shared policy layer below a shmem vma, or it may come from the task layer above a shmem vma; and without the final nodemask then nodeid cannot be decided. And how ilx is applied depends also on page order. The interleave index is almost always irrelevant unless MPOL_INTERLEAVE: with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX passed down from vma-less alloc_pages() is also used as hint not to use THP-style hugepage allocation - to avoid the overhead of a hugepage arg (though I don't understand why we never just added a GFP bit for THP - if it actually needs a different allocation strategy from other pages of the same order). vma_alloc_folio() still carries its hugepage arg here, but it is not used, and should be removed when agreed. get_vma_policy() no longer allows a NULL vma: over time I believe we've eradicated all the places which used to need it e.g. swapoff and madvise used to pass NULL vma to read_swap_cache_async(), but now know the vma. [hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()] Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18ipc: convert to new timestamp accessorsGravatar Jeff Layton 1-9/+10
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-78-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Gravatar Linus Torvalds 1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-29Merge tag 'sysctl-6.6-rc1' of ↵Gravatar Linus Torvalds 2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and placings sysctls to their own sybsystem or file to help avoid merge conflicts. Matthew Wilcox pointed out though that if we're going to do that we might as well also *save* space while at it and try to remove the extra last sysctl entry added at the end of each array, a sentintel, instead of bloating the kernel by adding a new sentinel with each array moved. Doing that was not so trivial, and has required slowing down the moves of kernel/sysctl.c arrays and measuring the impact on size by each new move. The complex part of the effort to help reduce the size of each sysctl is being done by the patient work of el señor Don Joel Granados. A lot of this is truly painful code refactoring and testing and then trying to measure the savings of each move and removing the sentinels. Although Joel already has code which does most of this work, experience with sysctl moves in the past shows is we need to be careful due to the slew of odd build failures that are possible due to the amount of random Kconfig options sysctls use. To that end Joel's work is split by first addressing the major housekeeping needed to remove the sentinels, which is part of this merge request. The rest of the work to actually remove the sentinels will be done later in future kernel releases. The preliminary math is showing this will all help reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array where we are able to remove each sentinel in the future. That also means there is no more bloating the kernel with the extra ~64 bytes per array moved as no new sentinels are created" * tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: Use ctl_table_size as stopping criteria for list macro sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl vrf: Update to register_net_sysctl_sz networking: Update to register_net_sysctl_sz netfilter: Update to register_net_sysctl_sz ax.25: Update to register_net_sysctl_sz sysctl: Add size to register_net_sysctl function sysctl: Add size arg to __register_sysctl_init sysctl: Add size to register_sysctl sysctl: Add a size arg to __register_sysctl_table sysctl: Add size argument to init_header sysctl: Add ctl_table_size to ctl_table_header sysctl: Use ctl_table_header in list_for_each_table_entry sysctl: Prefer ctl_table_header in proc_sysctl
2023-08-29Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of ↵Gravatar Linus Torvalds 1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - An extensive rework of kexec and crash Kconfig from Eric DeVolder ("refactor Kconfig to consolidate KEXEC and CRASH options") - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a couple of macros to args.h") - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper commands") - vsprintf inclusion rationalization from Andy Shevchenko ("lib/vsprintf: Rework header inclusions") - Switch the handling of kdump from a udev scheme to in-kernel handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory hot un/plug") - Many singleton patches to various parts of the tree * tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits) document while_each_thread(), change first_tid() to use for_each_thread() drivers/char/mem.c: shrink character device's devlist[] array x86/crash: optimize CPU changes crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() crash: hotplug support for kexec_load() x86/crash: add x86 crash hotplug support crash: memory and CPU hotplug sysfs attributes kexec: exclude elfcorehdr from the segment digest crash: add generic infrastructure for crash hotplug support crash: move a few code bits to setup support of crash hotplug kstrtox: consistently use _tolower() kill do_each_thread() nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse scripts/bloat-o-meter: count weak symbol sizes treewide: drop CONFIG_EMBEDDED lockdep: fix static memory detection even more lib/vsprintf: declare no_hash_pointers in sprintf.h lib/vsprintf: split out sprintf() and friends kernel/fork: stop playing lockless games for exe_file replacement adfs: delete unused "union adfs_dirtail" definition ...
2023-08-18ipc/sem: use flexible array in 'struct sem_undo'Gravatar Christophe JAILLET 1-4/+2
Turn 'semadj' in 'struct sem_undo' into a flexible array. The advantages are: - save the size of a pointer when the new undo structure is allocated - avoid some always ugly pointer arithmetic to get the address of semadj - avoid an indirection when the array is accessed While at it, use struct_size() to compute the size of the new undo structure. Link: https://lkml.kernel.org/r/1ba993d443ad7e16ac2b1902adab1f05ebdfa454.1688918791.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-15sysctl: Add a size arg to __register_sysctl_tableGravatar Joel Granados 2-2/+6
We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-07-24mqueue: convert to ctime accessor functionsGravatar Jeff Layton 1-12/+11
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-83-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-11mm: Re-introduce vm_flags to do_mmap()Gravatar Yu-cheng Yu 1-1/+1
There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by: commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()"). There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap(). Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com
2023-02-24Merge branch 'work.namespace' of ↵Gravatar Linus Torvalds 3-16/+26
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ipc namespace update from Al Viro: "Rik's patches reducing the amount of synchronize_rcu() triggered by ipc namespace destruction. I've some pending stuff reducing that on the normal umount side, but it's nowhere near ready and Rik's stuff shouldn't be held back due to conflicts - I'll just redo the parts of my series that stray into ipc/*" * 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ipc,namespace: batch free ipc_namespace structures ipc,namespace: make ipc namespace allocation wait for pending free
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Gravatar Linus Torvalds 1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-09ipc/shm: introduce new do_vma_munmap() to munmapGravatar Liam R. Howlett 1-5/+6
The shm already has the vma iterator in position for a write. do_vmi_munmap() searches for the correct position and aligns the write, so it is not the right function to use in this case. The shm VMA tree modification is similar to the brk munmap situation, the vma iterator is in position and the VMA is already known. This patch generalizes the brk munmap function do_brk_munmap() to be used for any other callers with the vma iterator already in position to munmap a VMA. Link: https://lkml.kernel.org/r/20230126212049.980501-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Sven Schnelle <svens@linux.ibm.com> Link: https://lore.kernel.org/linux-mm/yt9dh6wec21a.fsf@linux.ibm.com/ Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09ipc/shm: use the vma iterator for munmap callsGravatar Liam R. Howlett 1-6/+5
Pass through the vma iterator to do_vmi_munmap() to handle the iterator state internally Link: https://lkml.kernel.org/r/20230120162650.984577-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-27ipc,namespace: batch free ipc_namespace structuresGravatar Rik van Riel 3-10/+10
Instead of waiting for an RCU grace period between each ipc_namespace structure that is being freed, wait an RCU grace period for every batch of ipc_namespace structures. Thanks to Al Viro for the suggestion of the helper function. This speeds up the run time of the test case that allocates ipc_namespaces in a loop from 6 minutes, to a little over 1 second: real 0m1.192s user 0m0.038s sys 0m1.152s Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Chris Mason <clm@meta.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-01-27ipc,namespace: make ipc namespace allocation wait for pending freeGravatar Rik van Riel 1-6/+16
Currently the ipc namespace allocation will fail when there are ipc_namespace structures pending to be freed. This results in the simple test case below, as well as some real world workloads, to get allocation failures even when the number of ipc namespaces in actual use is way below the limit. int main() { int i; for (i = 0; i < 100000; i++) { if (unshare(CLONE_NEWIPC) < 0) error(EXIT_FAILURE, errno, "unshare"); } } Make the allocation of an ipc_namespace wait for pending frees, so it will succeed. real 6m19.197s user 0m0.041s sys 0m1.019s Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-01-19fs: port ->permission() to pass mnt_idmapGravatar Christian Brauner 1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->create() to pass mnt_idmapGravatar Christian Brauner 1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-18fs: port vfs_*() helpers to struct mnt_idmapGravatar Christian Brauner 1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-12-12Merge tag 'mm-nonmm-stable-2022-12-12' of ↵Gravatar Linus Torvalds 1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ...
2022-12-11ipc: fix memory leak in init_mqueue_fs()Gravatar Zhengchao Shao 1-2/+4
When setup_mq_sysctls() failed in init_mqueue_fs(), mqueue_inode_cachep is not released. In order to fix this issue, the release path is reordered. Link: https://lkml.kernel.org/r/20221209092929.1978875-1-shaozhengchao@huawei.com Fixes: dc55e35f9e81 ("ipc: Store mqueue sysctls in the ipc namespace") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jingyu Wang <jingyuwang_vip@163.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: YueHaibing <yuehaibing@huawei.com> Cc: Yu Zhe <yuzhe@nfschina.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-05ipc/sem: Fix dangling sem_array access in semtimedop raceGravatar Jann Horn 1-1/+2
When __do_semtimedop() goes to sleep because it has to wait for a semaphore value becoming zero or becoming bigger than some threshold, it links the on-stack sem_queue to the sem_array, then goes to sleep without holding a reference on the sem_array. When __do_semtimedop() comes back out of sleep, one of two things must happen: a) We prove that the on-stack sem_queue has been disconnected from the (possibly freed) sem_array, making it safe to return from the stack frame that the sem_queue exists in. b) We stabilize our reference to the sem_array, lock the sem_array, and detach the sem_queue from the sem_array ourselves. sem_array has RCU lifetime, so for case (b), the reference can be stabilized inside an RCU read-side critical section by locklessly checking whether the sem_queue is still connected to the sem_array. However, the current code does the lockless check on sem_queue before starting an RCU read-side critical section, so the result of the lockless check immediately becomes useless. Fix it by doing rcu_read_lock() before the lockless check. Now RCU ensures that if we observe the object being on our queue, the object can't be freed until rcu_read_unlock(). This bug is only hittable on kernel builds with full preemption support (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full). Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-22ipc/shm: call underlying open/close vm_opsGravatar Mike Kravetz 1-9/+25
Shared memory segments can be created that are backed by hugetlb pages. When this happens, the vmas associated with any mappings (shmat) are marked VM_HUGETLB, yet the vm_ops for such mappings are provided by ipc/shm (shm_vm_ops). There is a mechanism to call the underlying hugetlb vm_ops, and this is done for most operations. However, it is not done for open and close. This was not an issue until the introduction of the hugetlb vma_lock. This lock structure is pointed to by vm_private_data and the open/close vm_ops help maintain this structure. The special hugetlb routine called at fork took care of structure updates at fork time. However, vma_splitting is not properly handled for ipc shared memory mappings backed by hugetlb pages. This can result in a "kernel NULL pointer dereference" BUG or use after free as two vmas point to the same lock structure. Update the shm open and close routines to always call the underlying open and close routines. Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28ipc/msg.c: fix percpu_counter use after freeGravatar Andrew Morton 1-2/+2
These percpu counters are referenced in free_ipcs->freeque, so destroy them later. Fixes: 72d1e611082e ("ipc/msg: mitigate the lock contention with percpu counter") Reported-by: syzbot+96e659d35b9d6b541152@syzkaller.appspotmail.com Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Jiebin Sun <jiebin.sun@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12Merge tag 'mm-nonmm-stable-2022-10-11' of ↵Gravatar Linus Torvalds 5-40/+73
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Gravatar Linus Torvalds 1-11/+10
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-09Merge tag 'retire_mq_sysctls-for-v5.19' of ↵Gravatar Linus Torvalds 1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull mqueue fix from Eric Biederman: "A fix for an unlikely but possible memory leak" * tag 'retire_mq_sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc: mqueue: fix possible memory leak in init_mqueue_fs()
2022-10-03ipc/msg: mitigate the lock contention with percpu counterGravatar Jiebin Sun 3-17/+40
The msg_bytes and msg_hdrs atomic counters are frequently updated when IPC msg queue is in heavy use, causing heavy cache bounce and overhead. Change them to percpu_counter greatly improve the performance. Since there is one percpu struct per namespace, additional memory cost is minimal. Reading of the count done in msgctl call, which is infrequent. So the need to sum up the counts in each CPU is infrequent. Apply the patch and test the pts/stress-ng-1.4.0 -- system v message passing (160 threads). Score gain: 3.99x CPU: ICX 8380 x 2 sockets Core number: 40 x 2 physical cores Benchmark: pts/stress-ng-1.4.0 -- system v message passing (160 threads) [akpm@linux-foundation.org: coding-style cleanups] [jiebin.sun@intel.com: avoid negative value by overflow in msginfo] Link: https://lkml.kernel.org/r/20220920150809.4014944-1-jiebin.sun@intel.com [akpm@linux-foundation.org: fix min() warnings] Link: https://lkml.kernel.org/r/20220913192538.3023708-3-jiebin.sun@intel.com Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03ipc: mqueue: remove unnecessary conditionalsGravatar Jingyu Wang 1-2/+1
iput() already handles null and non-null parameters, so there is no need to use if(). Link: https://lkml.kernel.org/r/20220908185452.76590-1-jingyuwang_vip@163.com Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26ipc/shm: use VMA iterator instead of linked listGravatar Liam R. Howlett 1-11/+10
The VMA iterator is faster than the linked llist, and it can be walked even when VMAs are being removed from the address space, so there's no need to keep track of 'next'. Link: https://lkml.kernel.org/r/20220906194824.2110408-46-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11ipc/util.c: cleanup and improve sysvipc_find_ipc()Gravatar Manfred Spraul 1-21/+32
sysvipc_find_ipc() can be simplified further: - It uses a for() loop to locate the next entry in the idr. This can be replaced with idr_get_next(). - It receives two parameters (pos - which is actually an idr index and not a position, and new_pos, which is really a position). One parameter is sufficient. Link: https://lore.kernel.org/all/20210903052020.3265-3-manfred@colorfullife.com/ Link: https://lkml.kernel.org/r/20220805115733.104763-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Waiman Long <longman@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-07Merge tag 'mm-nonmm-stable-2022-08-06-2' of ↵Gravatar Linus Torvalds 1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. A relatively small amount of material this time" * tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) scripts/gdb: ensure the absolute path is generated on initial source MAINTAINERS: kunit: add David Gow as a maintainer of KUnit mailmap: add linux.dev alias for Brendan Higgins mailmap: update Kirill's email profile: setup_profiling_timer() is moslty not implemented ocfs2: fix a typo in a comment ocfs2: use the bitmap API to simplify code ocfs2: remove some useless functions lib/mpi: fix typo 'the the' in comment proc: add some (hopefully) insightful comments bdi: remove enum wb_congested_state kernel/hung_task: fix address space of proc_dohung_task_timeout_secs lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t() squashfs: support reading fragments in readahead call squashfs: implement readahead squashfs: always build "file direct" version of page actor Revert "squashfs: provide backing_dev_info in order to disable read-ahead" fs/ocfs2: Fix spelling typo in comment ia64: old_rr4 added under CONFIG_HUGETLB_PAGE proc: fix test for "vsyscall=xonly" boot option ...
2022-07-19ipc: mqueue: fix possible memory leak in init_mqueue_fs()Gravatar Hangyu Hua 1-0/+1
commit db7cfc380900 ("ipc: Free mq_sysctls if ipc namespace creation failed") Here's a similar memory leak to the one fixed by the patch above. retire_mq_sysctls need to be called when init_mqueue_fs fails after setup_mq_sysctls. Fixes: dc55e35f9e81 ("ipc: Store mqueue sysctls in the ipc namespace") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lkml.kernel.org/r/20220715062301.19311-1-hbh25y@gmail.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-07-17ipc/mqueue: remove unnecessary (void*) conversionGravatar Yu Zhe 1-1/+1
Remove unnecessary void* type casting. Link: https://lkml.kernel.org/r/20220628021251.17197-1-yuzhe@nfschina.com Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-22ipc: Free mq_sysctls if ipc namespace creation failedGravatar Alexey Gladkov 1-1/+4
The problem that Dmitry Vyukov pointed out is that if setup_ipc_sysctls fails, mq_sysctls must be freed before return. executing program BUG: memory leak unreferenced object 0xffff888112fc9200 (size 512): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): ef d3 60 85 ff ff ff ff 0c 9b d2 12 81 88 ff ff ..`............. 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814b6eb3>] kmemdup+0x23/0x50 mm/util.c:129 [<ffffffff82219a9b>] kmemdup include/linux/fortify-string.h:456 [inline] [<ffffffff82219a9b>] setup_mq_sysctls+0x4b/0x1c0 ipc/mq_sysctl.c:89 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fd5f00 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 00 92 fc 12 81 88 ff ff 00 00 00 00 01 00 00 00 ................ 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fea1b>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fea1b>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fea1b>] __register_sysctl_table+0x7b/0x7f0 fs/proc/proc_sysctl.c:1344 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fbba00 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 78 ba fb 12 81 88 ff ff 00 00 00 00 01 00 00 00 x............... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline] [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline] [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fbb900 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 78 b9 fb 12 81 88 ff ff 00 00 00 00 01 00 00 00 x............... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline] [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline] [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Reported-by: syzbot+b4b0d1b35442afbf6fd2@syzkaller.appspotmail.com Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/000000000000f5004705e1db8bad@google.com Link: https://lkml.kernel.org/r/20220622200729.2639663-1-legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-06-03Merge tag 'per-namespace-ipc-sysctls-for-v5.19' of ↵Gravatar Linus Torvalds 4-141/+205
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ipc sysctl namespace updates from Eric Biederman: "This updates the ipc sysctls so that they are fundamentally per ipc namespace. Previously these sysctls depended upon a hack to simulate being per ipc namespace by looking up the ipc namespace in read or write. With this set of changes the ipc sysctls are registered per ipc namespace and open looks up the ipc namespace. Not only does this series of changes ensure the traditional binding at open time happens, but it sets a foundation for being able to relax the permission checks to allow a user namspace root to change the ipc sysctls for an ipc namespace that the user namespace root requires. To do this requires the ipc namespace to be known at open time" * tag 'per-namespace-ipc-sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc: Remove extra braces ipc: Check permissions for checkpoint_restart sysctls at open time ipc: Remove extra1 field abuse to pass ipc namespace ipc: Use the same namespace to modify and validate ipc: Store ipc sysctls in the ipc namespace ipc: Store mqueue sysctls in the ipc namespace
2022-05-09ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()Gravatar Waiman Long 1-0/+14
When running the stress-ng clone benchmark with multiple testing threads, it was found that there were significant spinlock contention in sget_fc(). The contended spinlock was the sb_lock. It is under heavy contention because the following code in the critcal section of sget_fc(): hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) { if (test(old, fc)) goto share_extant_sb; } After testing with added instrumentation code, it was found that the benchmark could generate thousands of ipc namespaces with the corresponding number of entries in the mqueue's fs_supers list where the namespaces are the key for the search. This leads to excessive time in scanning the list for a match. Looking back at the mqueue calling sequence leading to sget_fc(): mq_init_ns() => mq_create_mount() => fc_mount() => vfs_get_tree() => mqueue_get_tree() => get_tree_keyed() => vfs_get_super() => sget_fc() Currently, mq_init_ns() is the only mqueue function that will indirectly call mqueue_get_tree() with a newly allocated ipc namespace as the key for searching. As a result, there will never be a match with the exising ipc namespaces stored in the mqueue's fs_supers list. So using get_tree_keyed() to do an existing ipc namespace search is just a waste of time. Instead, we could use get_tree_nodev() to eliminate the useless search. By doing so, we can greatly reduce the sb_lock hold time and avoid the spinlock contention problem in case a large number of ipc namespaces are present. Of course, if the code is modified in the future to allow mqueue_get_tree() to be called with an existing ipc namespace instead of a new one, we will have to use get_tree_keyed() in this case. The following stress-ng clone benchmark command was run on a 2-socket 48-core Intel system: ./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20 The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after patch. This is an increase of 54% in performance. Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com Fixes: 935c6912b198 ("ipc: Convert mqueue fs to fs_context") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09ipc: update semtimedop() to use hrtimerGravatar Prakash Sangappa 1-12/+11
semtimedop() should be converted to use hrtimer like it has been done for most of the system calls with timeouts. This system call already takes a struct timespec as an argument and can therefore provide finer granularity timed wait. Link: https://lkml.kernel.org/r/1651187881-2858-1-git-send-email-prakash.sangappa@oracle.com Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>