aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-03-04btrfs: change BUG_ON to assertion in reset_balance_state()Gravatar David Sterba 1-1/+1
The balance state machine is complex so it's good to verify the assumptions in helpers, however reset_balance_state() is used at the end of balance and fs_info::balance_ctl is properly set up before and protected by the exclusive op ownership in btrfs_balance(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ON to assertion when verifying root in ↵Gravatar David Sterba 1-1/+1
btrfs_alloc_reserved_file_extent() The file extents are normally reserved in subvolume roots but could be also in the data reloc tree. Change the BUG_ON to assertions as this verifies the usage assumptions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ON to assertion when verifying lockdep class setupGravatar David Sterba 1-1/+1
The BUG_ON in btrfs_set_buffer_lockdep_class() is a sanity check of the level which is verified in callers, e.g. when initializing an extent buffer or reading from an eb header. Change it to an assertion as this would not happen unless things are really bad and would fail elsewhere too. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ON to assertion in btrfs_read_roots()Gravatar David Sterba 1-1/+1
There's one caller of btrfs_read_roots() and that already uses the tree_root pointer, it's pointless to BUG_ON on it. As it's an assumption of the initialization helpers make it an assert instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: defrag: change BUG_ON to assertion in btrfs_defrag_leaves()Gravatar David Sterba 1-1/+1
The BUG_ON verifies a condition that should be guaranteed by the correct use of the path search (with keep_locks and lowest_level set), an assertion is the suitable check. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ON to assertion when checking for delayed_node rootGravatar David Sterba 1-1/+1
The pointer to root is initialized in btrfs_init_delayed_node(), no need to check for it again. Change the BUG_ON to assertion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()Gravatar David Sterba 1-2/+0
There's a BUG_ON checking for a valid pointer of fs_info::delayed_root but it is valid since init_mount_fs_info() and has the same lifetime as fs_info. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: export: handle invalid inode or root reference in btrfs_get_parent()Gravatar David Sterba 1-1/+8
The get_parent handler looks up a parent of a given dentry, this can be either a subvolume or a directory. The search is set up with offset -1 but it's never expected to find such item, as it would break allowed range of inode number or a root id. This means it's a corruption (ext4 also returns this error code). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid extent item reference found in check_committed_ref()Gravatar David Sterba 1-1/+8
The check_committed_ref() helper looks up an extent item by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a extent item offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()Gravatar David Sterba 1-1/+11
The unhandled case in btrfs_relocate_sys_chunks() loop is a corruption, as it could be caused only by two impossible conditions: - at first the search key is set up to look for a chunk tree item, with offset -1, this is an inexact search and the key->offset will contain the correct offset upon a successful search, a valid chunk tree item cannot have an offset -1 - after first successful search, the found_key corresponds to a chunk item, the offset is decremented by 1 before the next loop, it's impossible to find a chunk item there due to alignment and size constraints Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid root reference found in btrfs_init_root_free_objectid()Gravatar David Sterba 1-1/+8
The btrfs_init_root_free_objectid() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a root id. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid root reference found in btrfs_find_root()Gravatar David Sterba 1-1/+8
The btrfs_find_root() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break allowed the range of a root id. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle root deletion lookup error in btrfs_del_root()Gravatar David Sterba 1-2/+5
We're deleting a root and looking it up by key does not succeed, this is an inconsistent state and we can't do anything. All callers handle errors and abort a transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle block group lookup error when it's being removedGravatar David Sterba 1-1/+3
The unlikely case of lookup error in btrfs_remove_block_group() can be handled properly, in its caller this would lead to a transaction abort. We can't do anything else, a block group must have been loaded first. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid range and start in merge_extent_mapping()Gravatar David Sterba 1-4/+5
Turn a BUG_ON to a properly handled error and update the error message in the caller. It is expected that @em_in and @start passed to btrfs_add_extent_mapping() overlap. Besides tests, the only caller btrfs_get_extent() makes sure this is true. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle directory and dentry mismatch in btrfs_may_delete()Gravatar David Sterba 1-1/+3
The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete() to verify various conditions before deletion. There's a BUG_ON added before linux.git started, we can turn it to a proper error handling at least in our local helper. A mistmatch between directory and the deleted dentry is clearly invalid. This won't be probably ever hit due to the way how the parameters are set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper lookup_one(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use READ/WRITE_ONCE for fs_devices->read_policyGravatar Naohiro Aota 2-8/+9
Since we can read/modify the value from the sysfs interface concurrently, it would be better to protect it from compiler optimizations. Currently, there is only one read policy BTRFS_READ_POLICY_PID available, so no actual problem can happen now. This is a preparation for the future expansion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: preallocate temporary extent buffer for inode logging when neededGravatar Filipe Manana 3-36/+94
When logging an inode and we require to copy items from subvolume leaves to the log tree, we clone each subvolume leaf and than use that clone to copy items to the log tree. This is required to avoid possible deadlocks as stated in commit 796787c978ef ("btrfs: do not modify log tree while holding a leaf from fs tree locked"). The cloning requires allocating an extent buffer (struct extent_buffer) and then allocating pages (folios) to attach to the extent buffer. This may be slow in case we are under memory pressure, and since we are doing the cloning while holding a read lock on a subvolume leaf, it means we can be blocking other operations on that leaf for significant periods of time, which can increase latency on operations like creating other files, renaming files, etc. Similarly because we're under a log transaction, we may also cause extra delay on other tasks doing an fsync, because syncing the log requires waiting for tasks that joined a log transaction to exit the transaction. So to improve this, for any inode logging operation that needs to copy items from a subvolume leaf ("full sync" or "copy everything" bit set in the inode), preallocate a dummy extent buffer before locking any extent buffer from the subvolume tree, and even before joining a log transaction, add it to the log context and then use it when we need to copy items from a subvolume leaf to the log tree. This avoids making other operations get extra latency when waiting to lock a subvolume leaf that is used during inode logging and we are under heavy memory pressure. The following test script with bonnie++ was used to test this: $ cat test.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh MOUNT_OPTIONS="-o ssd" MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'` NR_DIRECTORIES=20 NR_FILES=20480 DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576)) DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES)) NR_FILES=$((NR_FILES / 1024)) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT The results of this test on a 8G VM running a non-debug kernel (Debian's default kernel config), were the following. Before this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++ Latency 35068us 24976us 2944ms 30725us 71770us 26152us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61 Latency 411ms 11914us 119ms 617ms 10296us 110ms After this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++ Latency 35975us 20945us 2144ms 10297us 2217us 6004us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59 Latency 320ms 11237us 77779us 518ms 6470us 86389us Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add comment about list_is_singular() use at btrfs_delete_unused_bgs()Gravatar Filipe Manana 1-0/+7
At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on a block group may not be immediately obvious. It is there to prevent losing raid profile information for a block group type (data, metadata or system), as that information is removed from fs_info->avail_[data|metadata|system]_alloc_bits when the last block group of a given type is deleted. So deleting the block group would later result in creating block groups of that type with a single profile (because fs_info->avail_*_alloc_bits would have a value of 0). This check was added in commit aefbe9a633b5 ("btrfs: Fix lost-data-profile caused by auto removing bg"). So add a comment mentioning the need for the check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: document what the spinlock unused_bgs_lock protectsGravatar Filipe Manana 1-0/+3
Add some comments to struct btrfs_fs_info to explicitly document which members are protected by the spinlock unused_bgs_lock. It is currently used to protect two linked lists, the reclaim_bgs and unused_bgs lists. So add an explicit comment on top of each list to mention its protected by unused_bgs_lock, as well as comment on top of unused_bgs_lock to mention the lists it protects. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: make btrfs_error_unpin_extent_range() return voidGravatar David Sterba 2-9/+7
This helper is used in transaction abort or cleanup context and the callers cannot handle all errors, only do best effort. btrfs_cleanup_one_transaction btrfs_destroy_delayed_refs btrfs_error_unpin_extent_range btrfs_destroy_pinned_extent btrfs_error_unpin_extent_range Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: return errors from unpin_extent_range()Gravatar David Sterba 2-5/+16
Handle the lookup failure of the block group to unpin, this is a logic error as the block group must exist at this point. If not, something else must have freed it, like clean_pinned_extents() would do without locking the unused_bg_unpin_mutex. Push the errors to the callers, proper handling will be done in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle errors returned from unpin_extent_cache()Gravatar David Sterba 2-3/+16
We've had numerous attempts to let function unpin_extent_cache() return void as it only returns 0. There are still error cases to handle so do that, in addition to the verbose messages. The only caller btrfs_finish_one_ordered() will now abort the transaction, previously it let it continue which could lead to further problems. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: zlib: Fix spelling mistake "infalte" -> "inflate"Gravatar Colin Ian King 1-1/+1
There is a spelling mistake in a warning message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: zstd: fix and simplify the inline extent decompression (v2)Gravatar Qu Wenruo 2-54/+24
Note: this is a fixed version that was previously reverted as e01a83e12604 ("Revert "btrfs: zstd: fix and simplify the inline extent decompression""), with fixed parameters to memzero_page(). [BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 3 (zstd) Then trying to reflink that extent in an aarch64 system with 64K page size, the reflink would just fail: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zstd_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should error out, without any proper explanation (this is copied from other decompression code). In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer nor error out, since the whole input/output buffer should never exceed one sector, thus we should not need to do any buffer switch. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modification: - Rename @start_byte to @dest_pgoff to properly express its meaning - Use @sectorsize other than PAGE_SIZE to properly initialize the output buffer size - Use correct destination offset inside the destination page - Simplify the main loop Since the input/output buffer should never switch, we only need one zstd_decompress_stream() call. - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And results the correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersGravatar David Sterba 42-65/+6
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace i_blocksize by fs_info::sectorsizeGravatar David Sterba 1-2/+2
The block size calculated by i_blocksize from inode is the same as what we have in fs_info, initalized in inode_init_always(). Unify that to use the fs_info value everywhere. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace sb::s_blocksize by fs_info::sectorsizeGravatar David Sterba 7-9/+11
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove duplicate recording of physical addressGravatar Johannes Thumshirn 1-2/+0
Remove the duplicate physical recording of the original write physical address in case of a single device write. This duplicated code is most likely present due to a rebase error. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: page to folio conversion in btrfs_truncate_block()Gravatar Goldwyn Rodrigues 1-22/+24
Convert use of struct page to struct folio inside btrfs_truncate_block(). The only page based function is set_page_extent_mapped(). All other functions have folio equivalents. Had to use __filemap_get_folio() because filemap_grab_folio() does not allow passing allocation mask as a parameter. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use a folio array throughout the defrag processGravatar Matthew Wilcox (Oracle) 1-23/+21
Remove more hidden calls to compound_head() by using an array of folios instead of pages. Also neaten the error path in defrag_one_range() by adjusting the length of the array instead of checking for NULL. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: convert defrag_prepare_one_page() to use a folioGravatar Matthew Wilcox (Oracle) 1-26/+27
Use a folio throughout defrag_prepare_one_page() to remove dozens of hidden calls to compound_head(). There is no support here for large folios; indeed, turn the existing check for PageCompound into a check for large folios. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add set_folio_extent_mapped() helperGravatar Matthew Wilcox (Oracle) 2-4/+9
Turn set_page_extent_mapped() into a wrapper around this version. Saves a call to compound_head() for callers who already have a folio and removes a couple of users of page->mapping. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: WARN_ON_ONCE() in our leak detection codeGravatar Josef Bacik 3-0/+3
fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak detection code (enabled only in debug builds) so that fstests will fail if these things trip at all. This will allow us to easily catch problems with our reference counting that may otherwise go unnoticed. Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove extent_map_tree forward declaration at extent_io.hGravatar Filipe Manana 1-2/+0
There's no need to do a forward declaration of struct extent_map_tree at extent_io.h, as there are no function prototypes, inline functions or data structures that refer to struct extent_map_tree. So remove that forward declaration, which is not needed since commit 477a30ba5f8d ("btrfs: Sink extent_tree arguments in try_release_extent_mapping"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: cache folio size and shift in extent_bufferGravatar Qu Wenruo 5-28/+42
After the conversion to folio interfaces (but without the patch to enable larger folio allocation), there is an LTP report about observable performance drop on metadata heavy operations. https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/ This drop is caused by the extra code of calculating the folio_size()/folio_shift(), instead of the old hard coded PAGE_SIZE/PAGE_SHIFT. To slightly reduce the overhead, just cache both folio_size and folio_shift in extent_buffer. The two new members (u32 folio_size and u8 folio_shift) are stored inside the holes of extent_buffer. folio_size is shared with len, which is reduced to u32. The size of eb does not change. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused variable bio_offset from end_bbio_data_read()Gravatar Qu Wenruo 1-9/+0
The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs: pass bio_offset to check_data_csum() directly"), when we are still using the same endio function for both data and metadata. Later we had several changes to data and metadata endio functions: - Data verification is handled by btrfs bio layer - Split data and metadata endio paths Now for data path we no longer do any verification in end_bbio_data_read(), as the verification is handled by btrfs bio layer already. Thus there is no need for such bio_offset variable. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove the pg_offset parameter from btrfs_get_extent()Gravatar Qu Wenruo 5-44/+36
The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to make sure we can only get inline extent at file offset 0. Any invalid inline extent with non-zero file offset would be rejected by tree-checker in the first place. Thus the @pg_offset parameter is not really necessary, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-02Merge tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxGravatar Linus Torvalds 1-1/+0
Pull xfs fix from Chandan Babu: "Drop experimental warning message when mounting an xfs filesystem on an fsdax device. We now consider xfs on fsdax to be stable" * tag 'xfs-6.8-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: drop experimental warning for FSDAX
2024-03-01Merge tag 'ceph-for-6.8-rc7' of https://github.com/ceph/ceph-clientGravatar Linus Torvalds 2-4/+9
Pull ceph fix from Ilya Dryomov: "Catch up with mdsmap encoding rectification which ended up being necessary after all to enable cluster upgrades from problematic v18.2.0 and v18.2.1 releases" * tag 'ceph-for-6.8-rc7' of https://github.com/ceph/ceph-client: ceph: switch to corrected encoding of max_xattr_size in mdsmap
2024-03-01Merge tag 'for-6.8-rc6-tag' of ↵Gravatar Linus Torvalds 6-35/+139
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix freeing allocated id for anon dev when snapshot creation fails - fiemap fixes: - followup for a recent deadlock fix, ranges that fiemap can access can still race with ordered extent completion - make sure fiemap with SYNC flag does not race with writes * tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double free of anonymous device after snapshot creation failure btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given btrfs: fix race between ordered extent completion and fiemap
2024-03-01Merge tag 'exfat-for-6.8-rc7' of ↵Gravatar Linus Torvalds 1-15/+22
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: - Fix ftruncate failure when allocating non-contiguous clusters * tag 'exfat-for-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix appending discontinuous clusters to empty file
2024-03-01Merge tag 'vfs-6.8-rc7.fixes' of ↵Gravatar Linus Torvalds 2-17/+14
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "Two small fixes: - Fix an endless loop during afs directory iteration caused by not skipping silly-rename files correctly. - Fix reporting of completion events for aio causing leaks in userspace. This is based on the fix last week as it's now possible to recognize aio events submitted through the old aio interface" * tag 'vfs-6.8-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/aio: Make io_cancel() generate completions again afs: Fix endless loop in directory parsing
2024-03-01Merge tag 'efi-fixes-for-v6.8-2' of ↵Gravatar Linus Torvalds 3-17/+15
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Only the EFI variable name size change is significant, and will be backported once it lands. The others are cleanup. - Fix phys_addr_t size confusion in 32-bit capsule loader - Reduce maximum EFI variable name size to 512 to work around buggy firmware - Drop some redundant code from efivarfs while at it" * tag 'efi-fixes-for-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Drop 'duplicates' bool parameter on efivar_init() efivarfs: Drop redundant cleanup on fill_super() failure efivarfs: Request at most 512 bytes for variable names efi/capsule-loader: fix incorrect allocation size
2024-02-29btrfs: fix double free of anonymous device after snapshot creation failureGravatar Filipe Manana 4-14/+14
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is givenGravatar Filipe Manana 2-14/+29
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that are no concurrent writes and we get a stable view of the inode's extent layout. When the flag is given we flush all IO (and wait for ordered extents to complete) and then lock the inode in shared mode, however that leaves open the possibility that a write might happen right after the flushing and before locking the inode. So fix this by flushing again after locking the inode - we leave the initial flushing before locking the inode to avoid holding the lock and blocking other RO operations while waiting for IO and ordered extents to complete. The second flushing while holding the inode's lock will most of the time do nothing or very little since the time window for new writes to have happened is small. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29btrfs: fix race between ordered extent completion and fiemapGravatar Filipe Manana 1-7/+96
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc). However by not locking the target extent range for the whole duration of the fiemap call we can race with an ordered extent. This happens like this: 1) The fiemap task finishes processing a file extent item that covers the file range [512K, 1M[, and that file extent item is the last item in the leaf currently being processed; 2) And ordered extent for the file range [768K, 2M[, in COW mode, completes (btrfs_finish_one_ordered()) and the file extent item covering the range [512K, 1M[ is trimmed to cover the range [512K, 768K[ and then a new file extent item for the range [768K, 2M[ is inserted in the inode's subvolume tree; 3) The fiemap task calls fiemap_next_leaf_item(), which then calls btrfs_next_leaf() to find the next leaf / item. This finds that the the next key following the one we previously processed (its type is BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding to the new file extent item inserted by the ordered extent, which has a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K; 4) Later the fiemap code ends up at emit_fiemap_extent() and triggers the warning: if (cache->offset + cache->len > offset) { WARN_ON(1); return -EINVAL; } Since we get 1M > 768K, because the previously emitted entry for the old extent covering the file range [512K, 1M[ ends at an offset that is greater than the new extent's start offset (768K). This makes fiemap fail with -EINVAL besides triggering the warning that produces a stack trace like the following: [1621.677651] ------------[ cut here ]------------ [1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs] [1621.677899] Modules linked in: btrfs blake2b_generic (...) [1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1 [1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678033] Code: 2b 4c 89 63 (...) [1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206 [1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000 [1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90 [1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000 [1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000 [1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850 [1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000 [1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0 [1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1621.678056] Call Trace: [1621.678074] <TASK> [1621.678076] ? __warn+0x80/0x130 [1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678159] ? report_bug+0x1f4/0x200 [1621.678164] ? handle_bug+0x42/0x70 [1621.678167] ? exc_invalid_op+0x14/0x70 [1621.678170] ? asm_exc_invalid_op+0x16/0x20 [1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678253] extent_fiemap+0x766/0xa30 [btrfs] [1621.678339] btrfs_fiemap+0x45/0x80 [btrfs] [1621.678420] do_vfs_ioctl+0x1e4/0x870 [1621.678431] __x64_sys_ioctl+0x6a/0xc0 [1621.678434] do_syscall_64+0x52/0x120 [1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76 There's also another case where before calling btrfs_next_leaf() we are processing a hole or a prealloc extent and we had several delalloc ranges within that hole or prealloc extent. In that case if the ordered extents complete before we find the next key, we may end up finding an extent item with an offset smaller than (or equals to) the offset in cache->offset. So fix this by changing emit_fiemap_extent() to address these three scenarios like this: 1) For the first case, steps listed above, adjust the length of the previously cached extent so that it does not overlap with the current extent, emit the previous one and cache the current file extent item; 2) For the second case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range, and the current file extent item has an offset that matches the offset in the fiemap cache, just discard what we have in the fiemap cache and assign the current file extent item to the cache, since it's more up to date; 3) For the third case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range and the offset of the file extent item we just found is smaller than what we have in the cache, just skip the current file extent item if its range end at or behind the cached extent's end, because we may have emitted (to the fiemap user space buffer) delalloc ranges that overlap with the current file extent item's range. If the file extent item's range goes beyond the end offset of the cached extent, just emit the cached extent and cache a subrange of the file extent item, that goes from the end offset of the cached extent to the end offset of the file extent item. Dealing with those cases in those ways makes everything consistent by reflecting the current state of file extent items in the btree and without emitting extents that have overlapping ranges (which would be confusing and violating expectations). This issue could be triggered often with test case generic/561, and was also hit and reported by Wang Yugui. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/ Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-27fs/aio: Make io_cancel() generate completions againGravatar Bart Van Assche 1-16/+11
The following patch accidentally removed the code for delivering completions for cancelled reads and writes to user space: "[PATCH 04/33] aio: remove retry-based AIO" (https://lore.kernel.org/all/1363883754-27966-5-git-send-email-koverstreet@google.com/) >From that patch: - if (kiocbIsCancelled(iocb)) { - ret = -EINTR; - aio_complete(iocb, ret, 0); - /* must not access the iocb after this */ - goto out; - } This leads to a leak in user space of a struct iocb. Hence this patch that restores the code that reports to user space that a read or write has been cancelled successfully. Fixes: 41003a7bcfed ("aio: remove retry-based AIO") Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-3-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-27afs: Fix endless loop in directory parsingGravatar David Howells 1-1/+3
If a directory has a block with only ".__afsXXXX" files in it (from uncompleted silly-rename), these .__afsXXXX files are skipped but without advancing the file position in the dir_context. This leads to afs_dir_iterate() repeating the block again and again. Fix this by making the code that skips the .__afsXXXX file also manually advance the file position. The symptoms are a soft lookup: watchdog: BUG: soft lockup - CPU#3 stuck for 52s! [check:5737] ... RIP: 0010:afs_dir_iterate_block+0x39/0x1fd ... ? watchdog_timer_fn+0x1a6/0x213 ... ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? afs_dir_iterate_block+0x39/0x1fd afs_dir_iterate+0x10a/0x148 afs_readdir+0x30/0x4a iterate_dir+0x93/0xd3 __do_sys_getdents64+0x6b/0xd4 This is almost certainly the actual fix for: https://bugzilla.kernel.org/show_bug.cgi?id=218496 Fixes: 57e9d49c5452 ("afs: Hide silly-rename files from userspace") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/786185.1708694102@warthog.procyon.org.uk Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Markus Suvanto <markus.suvanto@gmail.com> cc: linux-afs@lists.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-27xfs: drop experimental warning for FSDAXGravatar Shiyang Ruan 1-1/+0
FSDAX and reflink can work together now, let's drop this warning. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>