aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-11-30ext4: convert move_extent_per_page() to use foliosGravatar Vishal Moola (Oracle) 1-21/+31
Patch series "Removing the try_to_release_page() wrapper", v3. This patchset replaces the remaining calls of try_to_release_page() with the folio equivalent: filemap_release_folio(). This allows us to remove the wrapper. This patch (of 4): Convert move_extent_per_page() to use folios. This change removes 5 calls to compound_head() and is in preparation for the removal of the try_to_release_page() wrapper. Link: https://lkml.kernel.org/r/20221118073055.55694-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20221118073055.55694-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30hugetlbfs: inode: remove unnecessary (void*) conversionsGravatar Li zeming 1-1/+1
The ei pointer does not need to cast the type. Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: anonymous shared memory namingGravatar Pasha Tatashin 1-4/+11
Since commit 9a10064f5625 ("mm: add a field to store names for private anonymous memory"), name for private anonymous memory, but not shared anonymous, can be set. However, naming shared anonymous memory just as useful for tracking purposes. Extend the functionality to be able to set names for shared anon. There are two ways to create anonymous shared memory, using memfd or directly via mmap(): 1. fd = memfd_create(...) mem = mmap(..., MAP_SHARED, fd, ...) 2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...) In both cases the anonymous shared memory is created the same way by mapping an unlinked file on tmpfs. The memfd way allows to give a name for anonymous shared memory, but not useful when parts of shared memory require to have distinct names. Example use case: The VMM maps VM memory as anonymous shared memory (not private because VMM is sandboxed and drivers are running in their own processes). However, the VM tells back to the VMM how parts of the memory are actually used by the guest, how each of the segments should be backed (i.e. 4K pages, 2M pages), and some other information about the segments. The naming allows us to monitor the effective memory footprint for each of these segments from the host without looking inside the guest. Sample output: /* Create shared anonymous segmenet */ anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); /* Name the segment: "MY-NAME" */ rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, anon_shmem, SIZE, "MY-NAME"); cat /proc/<pid>/maps (and smaps): 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME] If the segment is not named, the output is: 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted) Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Colin Cross <ccross@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <cgel.zte@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30Merge branch 'mm-hotfixes-stable' into mm-stableGravatar Andrew Morton 9-32/+49
2022-11-30nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()Gravatar ZhangPeng 1-0/+7
Syzbot reported a null-ptr-deref bug: NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 1 PID: 3603 Comm: segctord Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0 fs/nilfs2/alloc.c:608 Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7 RSP: 0018:ffffc90003dff830 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010 RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158 R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0 Call Trace: <TASK> nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline] nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193 nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236 nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940 nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline] nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline] nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018 nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline] nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045 nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline] nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> ... If DAT metadata file is corrupted on disk, there is a case where req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during a b-tree operation that cascadingly updates ancestor nodes of the b-tree, because nilfs_dat_commit_alloc() for a lower level block can initialize the blocknr on the same DAT entry between nilfs_dat_prepare_end() and nilfs_dat_commit_end(). If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free() without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and causes the NULL pointer dereference above in nilfs_palloc_commit_free_entry() function, which leads to a crash. Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free(). This also calls nilfs_error() in that case to notify that there is a fatal flaw in the filesystem metadata and prevent further operations. Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirtyGravatar Chen Zhongjin 1-0/+8
When extending segments, nilfs_sufile_alloc() is called to get an unassigned segment, then mark it as dirty to avoid accidentally allocating the same segment in the future. But for some special cases such as a corrupted image it can be unreliable. If such corruption of the dirty state of the segment occurs, nilfs2 may reallocate a segment that is in use and pick the same segment for writing twice at the same time. This will cause the problem reported by syzkaller: https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24 This case started with segbuf1.segnum = 3, nextnum = 4 when constructed. It supposed segment 4 has already been allocated and marked as dirty. However the dirty state was corrupted and segment 4 usage was not dirty. For the first time nilfs_segctor_extend_segments() segment 4 was allocated again, which made segbuf2 and next segbuf3 had same segment 4. sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added to both buffer lists of two segbuf. It makes the lists broken which causes NULL pointer dereference. Fix the problem by setting usage as dirty every time in nilfs_sufile_mark_dirty(), which is called during constructing current segment to be written out and before allocating next segment. [chenzhongjin@huawei.com: add lock protection per Ryusuke] Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com Fixes: 9ff05123e3bf ("nilfs2: segment constructor") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com> Reported-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22proc/meminfo: fix spacing in SecPageTablesGravatar Yosry Ahmed 1-1/+1
SecPageTables has a tab after it instead of a space, this can break fragile parsers that depend on spaces after the stat names. Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com Fixes: ebc97a52b5d6cd5f ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08hugetlbfs: fix null-ptr-deref in hugetlbfs_parse_param()Gravatar Hawkins Jiawei 1-3/+3
Syzkaller reports a null-ptr-deref bug as follows: ====================================================== KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:hugetlbfs_parse_param+0x1dd/0x8e0 fs/hugetlbfs/inode.c:1380 [...] Call Trace: <TASK> vfs_parse_fs_param fs/fs_context.c:148 [inline] vfs_parse_fs_param+0x1f9/0x3c0 fs/fs_context.c:129 vfs_parse_fs_string+0xdb/0x170 fs/fs_context.c:191 generic_parse_monolithic+0x16f/0x1f0 fs/fs_context.c:231 do_new_mount fs/namespace.c:3036 [inline] path_mount+0x12de/0x1e20 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] </TASK> ====================================================== According to commit "vfs: parse: deal with zero length string value", kernel will set the param->string to null pointer in vfs_parse_fs_string() if fs string has zero length. Yet the problem is that, hugetlbfs_parse_param() will dereference the param->string, without checking whether it is a null pointer. To be more specific, if hugetlbfs_parse_param() parses an illegal mount parameter, such as "size=,", kernel will constructs struct fs_parameter with null pointer in vfs_parse_fs_string(), then passes this struct fs_parameter to hugetlbfs_parse_param(), which triggers the above null-ptr-deref bug. This patch solves it by adding sanity check on param->string in hugetlbfs_parse_param(). Link: https://lkml.kernel.org/r/20221020231609.4810-1-yin31149@gmail.com Reported-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com Tested-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000005ad00405eb7148c6@google.com/ Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hawkins Jiawei <yin31149@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm: remove kern_addr_valid() completelyGravatar Kefeng Wang 1-17/+9
Most architectures (except arm64/x86/sparc) simply return 1 for kern_addr_valid(), which is only used in read_kcore(), and it calls copy_from_kernel_nofault() which could check whether the address is a valid kernel address. So as there is no need for kern_addr_valid(), let's remove it. Link: https://lkml.kernel.org/r/20221018074014.185687-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Helge Deller <deller@gmx.de> [parisc] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: <aou@eecs.berkeley.edu> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Xuerui Wang <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08memory: move hotplug memory notifier priority to same file for easy sortingGravatar Liu Shixin 1-1/+1
The priority of hotplug memory callback is defined in a different file. And there are some callers using numbers directly. Collect them together into include/linux/memory.h for easy reading. This allows us to sort their priorities more intuitively without additional comments. Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08fs/proc/kcore.c: use hotplug_memory_notifier() directlyGravatar Liu Shixin 1-6/+1
Commit 76ae847497bc52 ("Documentation: raise minimum supported version of GCC to 5.1") updated the minimum gcc version to 5.1. So the problem mentioned in f02c69680088 ("include/linux/memory.h: implement register_hotmemory_notifier()") no longer exist. So we can now switch to use hotplug_memory_notifier() directly rather than register_hotmemory_notifier(). Link: https://lkml.kernel.org/r/20220923033347.3935160-3-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08hugetlbfs: convert hugetlb_delete_from_page_cache() to use foliosGravatar Sidhartha Kumar 1-6/+6
Remove the last caller of delete_from_page_cache() by converting the code to its folio equivalent. Link: https://lkml.kernel.org/r/20220922154207.1575343-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Cross <ccross@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/hugetlb: add hugetlb_folio_subpool() helpersGravatar Sidhartha Kumar 1-4/+4
Allow hugetlbfs_migrate_folio to check and read subpool information by passing in a folio. Link: https://lkml.kernel.org/r/20220922154207.1575343-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Cross <ccross@google.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hughd@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08fs: fix leaked psi pressure stateGravatar Johannes Weiner 2-13/+19
When psi annotations were added to to btrfs compression reads, the psi state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was faulty. A pressure state, once entered, is never left. This results in incorrectly elevated pressure, which triggers OOM kills. pflags record the *previous* memstall state when we enter a new one. The code tried to initialize pflags to 1, and then optimize the leave call when we either didn't enter a memstall, or were already inside a nested stall. However, there can be multiple PageWorkingset pages in the bio, at which point it's that path itself that enters repeatedly and overwrites pflags. This causes us to miss the exit. Enter the stall only once if needed, then unwind correctly. erofs has the same problem, fix that up too. And move the memstall exit past submit_bio() to restore submit accounting originally added by b8e24a9300b0 ("block: annotate refault stalls from IO submission"). Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads") Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space") Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer") Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08nilfs2: fix use-after-free bug of ns_writer on remountGravatar Ryusuke Konishi 2-9/+8
If a nilfs2 filesystem is downgraded to read-only due to metadata corruption on disk and is remounted read/write, or if emergency read-only remount is performed, detaching a log writer and synchronizing the filesystem can be done at the same time. In these cases, use-after-free of the log writer (hereinafter nilfs->ns_writer) can happen as shown in the scenario below: Task1 Task2 -------------------------------- ------------------------------ nilfs_construct_segment nilfs_segctor_sync init_wait init_waitqueue_entry add_wait_queue schedule nilfs_remount (R/W remount case) nilfs_attach_log_writer nilfs_detach_log_writer nilfs_segctor_destroy kfree finish_wait _raw_spin_lock_irqsave __raw_spin_lock_irqsave do_raw_spin_lock debug_spin_lock_before <-- use-after-free While Task1 is sleeping, nilfs->ns_writer is freed by Task2. After Task1 waked up, Task1 accesses nilfs->ns_writer which is already freed. This scenario diagram is based on the Shigeru Yoshida's post [1]. This patch fixes the issue by not detaching nilfs->ns_writer on remount so that this UAF race doesn't happen. Along with this change, this patch also inserts a few necessary read-only checks with superblock instance where only the ns_writer pointer was used to check if the filesystem is read-only. Link: https://syzkaller.appspot.com/bug?id=79a4c002e960419ca173d55e863bd09e8112df8b Link: https://lkml.kernel.org/r/20221103141759.1836312-1-syoshida@redhat.com [1] Link: https://lkml.kernel.org/r/20221104142959.28296-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+f816fa82f8783f7a02bb@syzkaller.appspotmail.com Reported-by: Shigeru Yoshida <syoshida@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08nilfs2: fix deadlock in nilfs_count_free_blocks()Gravatar Ryusuke Konishi 1-2/+0
A semaphore deadlock can occur if nilfs_get_block() detects metadata corruption while locating data blocks and a superblock writeback occurs at the same time: task 1 task 2 ------ ------ * A file operation * nilfs_truncate() nilfs_get_block() down_read(rwsem A) <-- nilfs_bmap_lookup_contig() ... generic_shutdown_super() nilfs_put_super() * Prepare to write superblock * down_write(rwsem B) <-- nilfs_cleanup_super() * Detect b-tree corruption * nilfs_set_log_cursor() nilfs_bmap_convert_error() nilfs_count_free_blocks() __nilfs_error() down_read(rwsem A) <-- nilfs_set_error() down_write(rwsem B) <-- *** DEADLOCK *** Here, nilfs_get_block() readlocks rwsem A (= NILFS_MDT(dat_inode)->mi_sem) and then calls nilfs_bmap_lookup_contig(), but if it fails due to metadata corruption, __nilfs_error() is called from nilfs_bmap_convert_error() inside the lock section. Since __nilfs_error() calls nilfs_set_error() unless the filesystem is read-only and nilfs_set_error() attempts to writelock rwsem B (= nilfs->ns_sem) to write back superblock exclusively, hierarchical lock acquisition occurs in the order rwsem A -> rwsem B. Now, if another task starts updating the superblock, it may writelock rwsem B during the lock sequence above, and can deadlock trying to readlock rwsem A in nilfs_count_free_blocks(). However, there is actually no need to take rwsem A in nilfs_count_free_blocks() because it, within the lock section, only reads a single integer data on a shared struct with nilfs_sufile_get_ncleansegs(). This has been the case after commit aa474a220180 ("nilfs2: add local variable to cache the number of clean segments"), that is, even before this bug was introduced. So, this resolves the deadlock problem by just not taking the semaphore in nilfs_count_free_blocks(). Link: https://lkml.kernel.org/r/20221029044912.9139-1-konishi.ryusuke@gmail.com Fixes: e828949e5b42 ("nilfs2: call nilfs_error inside bmap routines") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+45d6ce7b7ad7ef455d03@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [2.6.38+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08hugetlbfs: don't delete error page from pagecacheGravatar James Houghton 1-7/+6
This change is very similar to the change that was made for shmem [1], and it solves the same problem but for HugeTLBFS instead. Currently, when poison is found in a HugeTLB page, the page is removed from the page cache. That means that attempting to map or read that hugepage in the future will result in a new hugepage being allocated instead of notifying the user that the page was poisoned. As [1] states, this is effectively memory corruption. The fix is to leave the page in the page cache. If the user attempts to use a poisoned HugeTLB page with a syscall, the syscall will fail with EIO, the same error code that shmem uses. For attempts to map the page, the thread will get a BUS_MCEERR_AR SIGBUS. [1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens") Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-06Merge tag 'ext4_for_linus_stable' of ↵Gravatar Linus Torvalds 6-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of bugs, including some regressions, the most serious of which was one which would cause online resizes to fail with file systems with metadata checksums enabled. Also fix a warning caused by the newly added fortify string checker, plus some bugs that were found using fuzzed file systems" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix fortify warning in fs/ext4/fast_commit.c:1551 ext4: fix wrong return err in ext4_load_and_init_journal() ext4: fix warning in 'ext4_da_release_space' ext4: fix BUG_ON() when directory entry has invalid rec_len ext4: update the backup superblock's at the end of the online resize
2022-11-06Merge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Gravatar Linus Torvalds 6-62/+105
Pull cifs fixes from Steve French: "One symlink handling fix and two fixes foir multichannel issues with iterating channels, including for oplock breaks when leases are disabled" * tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix use-after-free on the link name cifs: avoid unnecessary iteration of tcp sessions cifs: always iterate smb sessions using primary channel
2022-11-06ext4: fix fortify warning in fs/ext4/fast_commit.c:1551Gravatar Theodore Ts'o 1-2/+3
With the new fortify string system, rework the memcpy to avoid this warning: memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4) Cc: stable@kernel.org Fixes: 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix wrong return err in ext4_load_and_init_journal()Gravatar Jason Yan 1-1/+1
The return value is wrong in ext4_load_and_init_journal(). The local variable 'err' need to be initialized before goto out. The original code in __ext4_fill_super() is fine because it has two return values 'ret' and 'err' and 'ret' is initialized as -EINVAL. After we factor out ext4_load_and_init_journal(), this code is broken. So fix it by directly returning -EINVAL in the error handler path. Cc: stable@kernel.org Fixes: 9c1dd22d7422 ("ext4: factor out ext4_load_and_init_journal()") Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix warning in 'ext4_da_release_space'Gravatar Ye Bin 1-1/+2
Syzkaller report issue as follows: EXT4-fs (loop0): Free/Dirty block details EXT4-fs (loop0): free_blocks=0 EXT4-fs (loop0): dirty_blocks=0 EXT4-fs (loop0): Block reservation details EXT4-fs (loop0): i_reserved_data_blocks=0 EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks ------------[ cut here ]------------ WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524 Modules linked in: CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd6605b2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Workqueue: writeback wb_workfn (flush-7:0) RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528 RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296 RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00 RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000 RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5 R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000 R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740 FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461 mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589 ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852 do_writepages+0x3c3/0x680 mm/page-writeback.c:2469 __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587 writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870 wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044 wb_do_writeback fs/fs-writeback.c:2187 [inline] wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Above issue may happens as follows: ext4_da_write_begin ext4_create_inline_data ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS); ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA); __ext4_ioctl ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag ext4_da_write_begin ext4_da_convert_inline_data_to_extent ext4_da_write_inline_data_begin ext4_da_map_blocks ext4_insert_delayed_block if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk)) ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1 allocated = true; ext4_es_insert_delayed_block(inode, lblk, allocated); ext4_writepages mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1 ext4_es_remove_extent ext4_da_release_space(inode, reserved); if (unlikely(to_free > ei->i_reserved_data_blocks)) -> to_free == 1 but ei->i_reserved_data_blocks == 0 -> then trigger warning as above To solve above issue, forbid inode do migrate which has inline data. Cc: stable@kernel.org Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix BUG_ON() when directory entry has invalid rec_lenGravatar Luís Henriques 1-1/+9
The rec_len field in the directory entry has to be a multiple of 4. A corrupted filesystem image can be used to hit a BUG() in ext4_rec_len_to_disk(), called from make_indexed_dir(). ------------[ cut here ]------------ kernel BUG at fs/ext4/ext4.h:2413! ... RIP: 0010:make_indexed_dir+0x53f/0x5f0 ... Call Trace: <TASK> ? add_dirent_to_buf+0x1b2/0x200 ext4_add_entry+0x36e/0x480 ext4_add_nondir+0x2b/0xc0 ext4_create+0x163/0x200 path_openat+0x635/0xe90 do_filp_open+0xb4/0x160 ? __create_object.isra.0+0x1de/0x3b0 ? _raw_spin_unlock+0x12/0x30 do_sys_openat2+0x91/0x150 __x64_sys_open+0x6c/0xa0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The fix simply adds a call to ext4_check_dir_entry() to validate the directory entry, returning -EFSCORRUPTED if the entry is invalid. CC: stable@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540 Signed-off-by: Luís Henriques <lhenriques@suse.de> Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-04cifs: fix use-after-free on the link nameGravatar ChenXiaoSong 2-6/+25
xfstests generic/011 reported use-after-free bug as follows: BUG: KASAN: use-after-free in __d_alloc+0x269/0x859 Read of size 15 at addr ffff8880078933a0 by task dirstress/952 CPU: 1 PID: 952 Comm: dirstress Not tainted 6.1.0-rc3+ #77 Call Trace: __dump_stack+0x23/0x29 dump_stack_lvl+0x51/0x73 print_address_description+0x67/0x27f print_report+0x3e/0x5c kasan_report+0x7b/0xa8 kasan_check_range+0x1b2/0x1c1 memcpy+0x22/0x5d __d_alloc+0x269/0x859 d_alloc+0x45/0x20c d_alloc_parallel+0xb2/0x8b2 lookup_open+0x3b8/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd Allocated by task 952: kasan_save_stack+0x1f/0x42 kasan_set_track+0x21/0x2a kasan_save_alloc_info+0x17/0x1d __kasan_kmalloc+0x7e/0x87 __kmalloc_node_track_caller+0x59/0x155 kstrndup+0x60/0xe6 parse_mf_symlink+0x215/0x30b check_mf_symlink+0x260/0x36a cifs_get_inode_info+0x14e1/0x1690 cifs_revalidate_dentry_attr+0x70d/0x964 cifs_revalidate_dentry+0x36/0x62 cifs_d_revalidate+0x162/0x446 lookup_open+0x36f/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 950: kasan_save_stack+0x1f/0x42 kasan_set_track+0x21/0x2a kasan_save_free_info+0x1c/0x34 ____kasan_slab_free+0x1c1/0x1d5 __kasan_slab_free+0xe/0x13 __kmem_cache_free+0x29a/0x387 kfree+0xd3/0x10e cifs_fattr_to_inode+0xb6a/0xc8c cifs_get_inode_info+0x3cb/0x1690 cifs_revalidate_dentry_attr+0x70d/0x964 cifs_revalidate_dentry+0x36/0x62 cifs_d_revalidate+0x162/0x446 lookup_open+0x36f/0x9f9 open_last_lookups+0x63d/0xc26 path_openat+0x11a/0x261 do_filp_open+0xcc/0x168 do_sys_openat2+0x13b/0x3f7 do_sys_open+0x10f/0x146 __se_sys_creat+0x27/0x2e __x64_sys_creat+0x55/0x6a do_syscall_64+0x40/0x96 entry_SYSCALL_64_after_hwframe+0x63/0xcd When opened a symlink, link name is from 'inode->i_link', but it may be reset to a new value when revalidate the dentry. If some processes get the link name on the race scenario, then UAF will happen on link name. Fix this by implementing 'get_link' interface to duplicate the link name. Fixes: 76894f3e2f71 ("cifs: improve symlink handling for smb2+") Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-04cifs: avoid unnecessary iteration of tcp sessionsGravatar Shyam Prasad N 3-51/+55
In a few places, we do unnecessary iterations of tcp sessions, even when the server struct is provided. The change avoids it and uses the server struct provided. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-04cifs: always iterate smb sessions using primary channelGravatar Shyam Prasad N 4-5/+25
smb sessions and tcons currently hang off primary channel only. Secondary channels have the lists as empty. Whenever there's a need to iterate sessions or tcons, we should use the list in the corresponding primary channel. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-11-04Merge tag 'xfs-6.1-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxGravatar Linus Torvalds 29-386/+670
Pull xfs fixes from Darrick Wong: "Dave and I had thought that this would be a very quiet cycle, but we thought wrong. At first there were the usual trickle of minor bugfixes, but then Zorro pulled -rc1 and noticed complaints about the stronger memcpy checks w.r.t. flex arrays. Analyzing how to fix that revealed a bunch of validation gaps in validating ondisk log items during recovery, and then a customer hit an infinite loop in the refcounting code on a corrupt filesystem. So. This largeish batch of fixes addresses all those problems, I hope. Summary: - Fix a UAF bug during log recovery - Fix memory leaks when mount fails - Detect corrupt bestfree information in a directory block - Fix incorrect return value type for the dax page fault handlers - Fix fortify complaints about memcpy of xfs log item objects - Strengthen inadequate validation of recovered log items - Fix incorrectly declared flex array in EFI log item structs - Log corrupt log items for debugging purposes - Fix infinite loop problems in the refcount code if the refcount btree node block keys are corrupt - Fix infinite loop problems in the refcount code if the refcount btree records suffer MSB bitflips - Add more sanity checking to continued defer ops to prevent overflows from one AG to the next or off EOFS" * tag 'xfs-6.1-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: rename XFS_REFC_COW_START to _COWFLAG xfs: fix uninitialized list head in struct xfs_refcount_recovery xfs: fix agblocks check in the cow leftover recovery function xfs: check record domain when accessing refcount records xfs: remove XFS_FIND_RCEXT_SHARED and _COW xfs: refactor domain and refcount checking xfs: report refcount domain in tracepoints xfs: track cow/shared record domains explicitly in xfs_refcount_irec xfs: refactor refcount record usage in xchk_refcountbt_rec xfs: dump corrupt recovered log intent items to dmesg consistently xfs: move _irec structs to xfs_types.h xfs: actually abort log recovery on corrupt intent-done log items xfs: check deferred refcount op continuation parameters xfs: refactor all the EFI/EFD log item sizeof logic xfs: create a predicate to verify per-AG extents xfs: fix memcpy fortify errors in EFI log format copying xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents xfs: fix memcpy fortify errors in RUI log format copying xfs: fix memcpy fortify errors in CUI log format copying xfs: fix memcpy fortify errors in BUI log format copying ...
2022-11-03Merge tag 'fuse-fixes-6.1-rc4' of ↵Gravatar Linus Torvalds 2-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Fix two rarely triggered but long-standing issues" * tag 'fuse-fixes-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add file_modified() to fallocate fuse: fix readdir cache race
2022-11-03Merge tag 'for-6.1-rc3-tag' of ↵Gravatar Linus Torvalds 5-49/+91
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A batch of error handling fixes for resource leaks, fixes for nowait mode in combination with direct and buffered IO: - direct IO + dsync + nowait could miss a sync of the file after write, add handling for this combination - buffered IO + nowait should not fail with ENOSPC, only blocking IO could determine that - error handling fixes: - fix inode reserve space leak due to nowait buffered write - check the correct variable after allocation (direct IO submit) - fix inode list leak during backref walking - fix ulist freeing in self tests" * tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix inode reserve space leak due to nowait buffered write btrfs: fix nowait buffered write returning -ENOSPC btrfs: remove pointless and double ulist frees in error paths of qgroup tests btrfs: fix ulist leaks in error paths of qgroup self tests btrfs: fix inode list leak during backref walking at find_parent_nodes() btrfs: fix inode list leak during backref walking at resolve_indirect_refs() btrfs: fix lost file sync on direct IO write with nowait and dsync iocb btrfs: fix a memory allocation failure test in btrfs_submit_direct
2022-11-02Merge tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfsGravatar Linus Torvalds 18-75/+80
Pull NFS client bugfixes from Anna Schumaker: - Fix some coccicheck warnings - Avoid memcpy() run-time warning - Fix up various state reclaim / RECLAIM_COMPLETE errors - Fix a null pointer dereference in sysfs - Fix LOCK races - Fix gss_unwrap_resp_integ() crasher - Fix zero length clones - Fix memleak when allocate slot fails * tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs4: Fix kmemleak when allocate slot failed NFSv4.2: Fixup CLONE dest file size for zero-length count SUNRPC: Fix crasher in gss_unwrap_resp_integ() NFSv4: Retry LOCK on OLD_STATEID during delegation return SUNRPC: Fix null-ptr-deref when xps sysfs alloc failed NFSv4.1: We must always send RECLAIM_COMPLETE after a reboot NFSv4.1: Handle RECLAIM_COMPLETE trunking errors NFSv4: Fix a potential state reclaim deadlock NFS: Avoid memcpy() run-time warning for struct sockaddr overflows nfs: Remove redundant null checks before kfree
2022-11-02btrfs: fix inode reserve space leak due to nowait buffered writeGravatar Filipe Manana 1-1/+3
During a nowait buffered write, if we fail to balance dirty pages we exit btrfs_buffered_write() without releasing the delalloc space reserved for an extent, resulting in leaking space from the inode's block reserve. So fix that by releasing the delalloc space for the extent when balancing dirty pages fails. Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com Fixes: 965f47aeb5de ("btrfs: make btrfs_buffered_write nowait compatible") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix nowait buffered write returning -ENOSPCGravatar Filipe Manana 1-0/+3
If we are doing a buffered write in NOWAIT context and we can't reserve metadata space due to -ENOSPC, then we should return -EAGAIN so that we retry the write in a context allowed to block and do metadata reservation with flushing, which might succeed this time due to the allowed flushing. Returning -ENOSPC while in NOWAIT context simply makes some writes fail with -ENOSPC when they would likely succeed after switching from NOWAIT context to blocking context. That is unexpected behaviour and even fio complains about it with a warning like this: fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536 fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device The fio's job config is this: [global] bs=64K ioengine=io_uring iodepth=1 size=2236962133 nr_files=1 filesize=2236962133 direct=0 runtime=10 fallocate=posix io_size=2236962133 group_reporting time_based [task_0] rw=randwrite directory=/mnt/sdi numjobs=4 So fix this by returning -EAGAIN if we are in NOWAIT context and the metadata reservation failed with -ENOSPC. Fixes: 304e45acdb8f ("btrfs: plumb NOWAIT through the write path") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: remove pointless and double ulist frees in error paths of qgroup testsGravatar Filipe Manana 1-12/+4
Several places in the qgroup self tests follow the pattern of freeing the ulist pointer they passed to btrfs_find_all_roots() if the call to that function returned an error. That is pointless because that function always frees the ulist in case it returns an error. Also In some places like at test_multiple_refs(), after a call to btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots" pointing to ulists that were freed, because btrfs_qgroup_account_extent() has freed those ulists, and if after that the next call to btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots" ulist again, resulting in a double free. So remove those calls to reduce the code size and avoid double ulist free in case of an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix ulist leaks in error paths of qgroup self testsGravatar Filipe Manana 1-5/+15
In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests, if we fail to add the tree ref, remove the extent item or remove the extent ref, we are returning from the test function without freeing the "old_roots" ulist that was allocated by the previous calls to btrfs_find_all_roots(). Fix that by calling ulist_free() before returning. Fixes: 442244c96332 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix inode list leak during backref walking at find_parent_nodes()Gravatar Filipe Manana 1-1/+17
During backref walking, at find_parent_nodes(), if we are dealing with a data extent and we get an error while resolving the indirect backrefs, at resolve_indirect_refs(), or in the while loop that iterates over the refs in the direct refs rbtree, we end up leaking the inode lists attached to the direct refs we have in the direct refs rbtree that were not yet added to the refs ulist passed as argument to find_parent_nodes(). Since they were not yet added to the refs ulist and prelim_release() does not free the lists, on error the caller can only free the lists attached to the refs that were added to the refs ulist, all the remaining refs get their inode lists never freed, therefore leaking their memory. Fix this by having prelim_release() always free any attached inode list to each ref found in the rbtree, and have find_parent_nodes() set the ref's inode list to NULL once it transfers ownership of the inode list to a ref added to the refs ulist passed to find_parent_nodes(). Fixes: 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-02btrfs: fix inode list leak during backref walking at resolve_indirect_refs()Gravatar Filipe Manana 1-19/+17
During backref walking, at resolve_indirect_refs(), if we get an error we jump to the 'out' label and call ulist_free() on the 'parents' ulist, which frees all the elements in the ulist - however that does not free any inode lists that may be attached to elements, through the 'aux' field of a ulist node, so we end up leaking lists if we have any attached to the unodes. Fix this by calling free_leaf_list() instead of ulist_free() when we exit from resolve_indirect_refs(). The static function free_leaf_list() is moved up for this to be possible and it's slightly simplified by removing unnecessary code. Fixes: 3301958b7c1d ("Btrfs: add inodes before dropping the extent lock in find_all_leafs") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-01Merge tag 'nfsd-6.1-3' of ↵Gravatar Linus Torvalds 1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a loop that occurs when using multiple net namespaces * tag 'nfsd-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix net-namespace logic in __nfsd_file_cache_purge
2022-11-01nfsd: fix net-namespace logic in __nfsd_file_cache_purgeGravatar Jeff Layton 1-3/+2
If the namespace doesn't match the one in "net", then we'll continue, but that doesn't cause another rhashtable_walk_next call, so it will loop infinitely. Fixes: ce502f81ba88 ("NFSD: Convert the filecache to use rhashtable") Reported-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/ltp/Y1%2FP8gDAcWC%2F+VR3@pevik/ Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-10-31Merge tag 'for-6.1-rc3-tag' of ↵Gravatar Linus Torvalds 10-40/+73
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes and regression fixes: - fix a corner case when handling tree-mod-log chagnes in reallocated notes - fix crash on raid0 filesystems created with <5.4 mkfs.btrfs that could lead to division by zero - add missing super block checksum verification after thawing filesystem - handle one more case in send when dealing with orphan files - fix parameter type mismatch for generation when reading dentry - improved error handling in raid56 code - better struct bio packing after recent cleanups" * tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't use btrfs_chunk::sub_stripes from disk btrfs: fix type of parameter generation in btrfs_get_dentry btrfs: send: fix send failure of a subcase of orphan inodes btrfs: make thaw time super block check to also verify checksum btrfs: fix tree mod log mishandling of reallocated nodes btrfs: reorder btrfs_bio for better packing btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed btrfs: raid56: properly handle the error when unable to find the missing stripe
2022-10-31xfs: rename XFS_REFC_COW_START to _COWFLAGGravatar Darrick J. Wong 3-6/+6
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and a bit flag, even though it's *only* a bit flag. Rename the variable to reflect its nature and update the cast target since we're not supposed to be comparing it to xfs_agblock_t now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix uninitialized list head in struct xfs_refcount_recoveryGravatar Darrick J. Wong 1-4/+6
We're supposed to initialize the list head of an object before adding it to another list. Fix that, and stop using the kmem_{alloc,free} calls from the Irix days. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix agblocks check in the cow leftover recovery functionGravatar Darrick J. Wong 1-1/+3
As we've seen, refcount records use the upper bit of the rc_startblock field to ensure that all the refcount records are at the right side of the refcount btree. This works because an AG is never allowed to have more than (1U << 31) blocks in it. If we ever encounter a filesystem claiming to have that many blocks, we absolutely do not want reflink touching it at all. However, this test at the start of xfs_refcount_recover_cow_leftovers is slightly incorrect -- it /should/ be checking that agblocks isn't larger than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the constant is never large enough to conflict with that CoW flag. Note that the V5 superblock verifier has not historically rejected filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this ended up in the COW recovery routine. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check record domain when accessing refcount recordsGravatar Darrick J. Wong 2-14/+43
Now that we've separated the startblock and CoW/shared extent domain in the incore refcount record structure, check the domain whenever we retrieve a record to ensure that it's still in the domain that we want. Depending on the circumstances, a change in domain either means we're done processing or that we've found a corruption and need to fail out. The refcount check in xchk_xref_is_cow_staging is redundant since _get_rec has done that for a long time now, so we can get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: remove XFS_FIND_RCEXT_SHARED and _COWGravatar Darrick J. Wong 1-31/+17
Now that we have an explicit enum for shared and CoW staging extents, we can get rid of the old FIND_RCEXT flags. Omit a couple of conversions that disappear in the next patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor domain and refcount checkingGravatar Darrick J. Wong 3-10/+17
Create a helper function to ensure that CoW staging extent records have a single refcount and that shared extent records have more than 1 refcount. We'll put this to more use in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: report refcount domain in tracepointsGravatar Darrick J. Wong 2-9/+43
Now that we've broken out the startblock and shared/cow domain in the incore refcount extent record structure, update the tracepoints to report the domain. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: track cow/shared record domains explicitly in xfs_refcount_irecGravatar Darrick J. Wong 5-67/+151
Just prior to committing the reflink code into upstream, the xfs maintainer at the time requested that I find a way to shard the refcount records into two domains -- one for records tracking shared extents, and a second for tracking CoW staging extents. The idea here was to minimize mount time CoW reclamation by pushing all the CoW records to the right edge of the keyspace, and it was accomplished by setting the upper bit in rc_startblock. We don't allow AGs to have more than 2^31 blocks, so the bit was free. Unfortunately, this was a very late addition to the codebase, so most of the refcount record processing code still treats rc_startblock as a u32 and pays no attention to whether or not the upper bit (the cow flag) is set. This is a weakness is theoretically exploitable, since we're not fully validating the incoming metadata records. Fuzzing demonstrates practical exploits of this weakness. If the cow flag of a node block key record is corrupted, a lookup operation can go to the wrong record block and start returning records from the wrong cow/shared domain. This causes the math to go all wrong (since cow domain is still implicit in the upper bit of rc_startblock) and we can crash the kernel by tricking xfs into jumping into a nonexistent AG and tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL. To fix this, start tracking the domain as an explicit part of struct xfs_refcount_irec, adjust all refcount functions to check the domain of a returned record, and alter the function definitions to accept them where necessary. Found by fuzzing keys[2].cowflag = add in xfs/464. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor refcount record usage in xchk_refcountbt_recGravatar Darrick J. Wong 1-30/+24
Consolidate the open-coded xfs_refcount_irec fields into an actual struct and use the existing _btrec_to_irec to decode the ondisk record. This will reduce code churn in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: move _irec structs to xfs_types.hGravatar Darrick J. Wong 2-20/+20
Structure definitions for incore objects do not belong in the ondisk format header. Move them to the incore types header where they belong. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check deferred refcount op continuation parametersGravatar Darrick J. Wong 1-2/+36
If we're in the middle of a deferred refcount operation and decide to roll the transaction to avoid overflowing the transaction space, we need to check the new agbno/aglen parameters that we're about to record in the new intent. Specifically, we need to check that the new extent is completely within the filesystem, and that continuation does not put us into a different AG. If the keys of a node block are wrong, the lookup to resume an xfs_refcount_adjust_extents operation can put us into the wrong record block. If this happens, we might not find that we run out of aglen at an exact record boundary, which will cause the loop control to do the wrong thing. The previous patch should take care of that problem, but let's add this extra sanity check to stop corruption problems sooner than later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>