linux-net

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/ synced 2026-04-18 06:33:43 -04:00

Author	SHA1	Message	Date
Sun YangKai	19eff93dc7	btrfs: fix periodic reclaim condition Problems with current implementation: 1. reclaimable_bytes is signed while chunk_sz is unsigned, causing negative reclaimable_bytes to trigger reclaim unexpectedly 2. The "space must be freed between scans" assumption breaks the two-scan requirement: first scan marks block groups, second scan reclaims them. Without the second scan, no reclamation occurs. Instead, track actual reclaim progress: pause reclaim when block groups will be reclaimed, and resume only when progress is made. This ensures reclaim continues until no further progress can be made. And resume periodic reclaim when there's enough free space. And we take care if reclaim is making any progress now, so it's unnecessary to set periodic_reclaim_ready to false when failed to reclaim a block group. Fixes: `813d4c6422` ("btrfs: prevent pathological periodic reclaim loops") CC: stable@vger.kernel.org # 6.12+ Suggested-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:37 +01:00
Filipe Manana	4ac81c3811	btrfs: use the btrfs_block_group_end() helper everywhere We have a helper to calculate a block group's exclusive end offset, but we only use it in some places. Update every site that open codes the calculation to use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	2aef934b56	btrfs: populate fully_remapped_bgs_list on mount Add a function btrfs_populate_fully_remapped_bgs_list() which gets called on mount, which looks for fully remapped block groups (i.e. identity_remap_count == 0) which haven't yet had their chunk stripes and device extents removed. This happens when a filesystem is unmounted while async discard has not yet finished, as otherwise the data range occupied by the chunk stripes would be permanently unusable. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	7cddbb4339	btrfs: handle discarding fully-remapped block groups Discard normally works by iterating over the free-space entries of a block group. This doesn't work for fully-remapped block groups, as we removed their free-space entries when we started relocation. For sync discard, call btrfs_discard_extent() when we commit the transaction in which the last identity remap was removed. For async discard, add a new function btrfs_trim_fully_remapped_block_group() to be called by the discard worker, which iterates over the block group's range using the normal async discard rules. Once we reach the end, remove the chunk's stripes and device extents to get back its free space. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:36 +01:00
Mark Harmstone	b56f35560b	btrfs: handle setting up relocation of block group with remap-tree Handle the preliminary work for relocating a block group in a filesystem with the remap-tree flag set. If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it does already, as bootstrapping issues mean that these block groups have to be processed the existing way. Similarly with METADATA_REMAP blocks, which are dealt with in a later patch. Otherwise we walk the free-space tree for the block group in question, recording any holes. These get converted into identity remaps and placed in the remap tree, and the block group's REMAPPED flag is set. From now on no new allocations are possible within this block group, and any I/O to it will be funnelled through btrfs_translate_remap(). We store the number of identity remaps in `identity_remap_count`, so that we know when we've removed the last one and the block group is fully remapped. The change in btrfs_read_roots() is because data relocations no longer rely on the data reloc tree as a hidden subvolume in which to do snapshots. (Thanks to Sun YangKai for his suggestions.) Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	979e1dc3d6	btrfs: handle deletions from remapped block group Handle the case where we free an extent from a block group that has the REMAPPED flag set. Because the remap tree is orthogonal to the extent tree, for data this may be within any number of identity remaps or actual remaps. If we're freeing a metadata node, this will be wholly inside one or the other. btrfs_remove_extent_from_remap_tree() searches the remap tree for the remaps that cover the range in question, then calls remove_range_from_remap_tree() for each one, to punch a hole in the remap and adjust the free-space tree. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. Fully remapped block groups have their chunk stripes removed and their device extents freed, which makes the disk space available again to the chunk allocator. This happens asynchronously: in the cleaner thread for sync discard and nodiscard, and (in a later patch) in the discard worker for async discard. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:35 +01:00
Mark Harmstone	7977011460	btrfs: add extended version of struct block_group_item Add a struct btrfs_block_group_item_v2, which is used in the block group tree if the remap-tree incompat flag is set. This adds two new fields to the block group item: `remap_bytes` and `identity_remap_count`. `remap_bytes` records the amount of data that's physically within this block group, but nominally in another, remapped block group. This is necessary because this data will need to be moved first if this block group is itself relocated. If `remap_bytes` > 0, this is an indicator to the relocation thread that it will need to search the remap-tree for backrefs. A block group must also have `remap_bytes` == 0 before it can be dropped. `identity_remap_count` records how many identity remap items are located in the remap tree for this block group. When relocation is begun for this block group, this is set to the number of holes in the free-space tree for this range. As identity remaps are converted into actual remaps by the relocation process, this number is decreased. Once it reaches 0, either because of relocation or because extents have been deleted, the block group has been fully remapped and its chunk's device extents are removed. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	bf8ff4b9f0	btrfs: rename struct btrfs_block_group field commit_used to last_used Rename the field commit_used in struct btrfs_block_group to last_used, for clarity and consistency with the similar fields we're about to add. It's not obvious that commit_flags means "flags as of the last commit" rather than "flags related to a commit". Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Mark Harmstone	76377db18a	btrfs: remove remapped block groups from the free-space-tree No new allocations can be done from block groups that have the REMAPPED flag set, so there's no value in their having entries in the free-space tree. Prevent a search through the free-space tree being scheduled for such a block group, and prevent any additions to the in-memory free-space tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:54:34 +01:00
Filipe Manana	c208aa0ef6	btrfs: add and use helper to compute the available space for a block group We have currently three places that compute how much available space a block group has. Add a helper function for this and use it in those places. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:44 +01:00
David Sterba	e582f22030	btrfs: split btrfs_fs_closing() and change return type to bool There are two tests in btrfs_fs_closing() but checking the BTRFS_FS_CLOSING_DONE bit is done only in one place load_extent_tree_free(). As this is an inline we can reduce size of the generated code. The types can be also changed to bool as this becomes a simple condition. text data bss dec hex filename 1674006 146704 15560 1836270 1c04ee pre/btrfs.ko 1673772 146704 15560 1836036 1c0404 post/btrfs.ko DELTA: -234 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:51:42 +01:00
Johannes Thumshirn	a464ed9834	btrfs: rename btrfs_create_block_group_cache to btrfs_create_block_group struct btrfs_block_group used to be called struct btrfs_block_group_cache but got renamed to btrfs_block_group with commit `32da5386d9` ("btrfs: rename btrfs_block_group_cache"). Rename btrfs_create_block_group_cache() to btrfs_create_block_group() to reflect that change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 07:48:44 +01:00
David Sterba	4b117be65f	btrfs: merge setting ret and return ret In many places we have pattern: ret = ...; return ret; This can be simplified to a direct return, removing 'ret' if not otherwise needed. The places in self tests are not converted so we can add more test cases without changing surrounding code (extent-map-tests.c:test_case_4()). Signed-off-by: David Sterba <dsterba@suse.com>	2026-02-03 06:38:33 +01:00
Linus Torvalds	7696286034	Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now): - set filesystem state as being shut down (also named going down in other filesystems), where all active operations return EIO and this cannot be changed until unmount - pending operations are attempted to be finished but error messages may still show up depending on where exactly the shutdown happened - scrub (and device replace) vs suspend/hibernate: - a running scrub will prevent suspend, which can be annoying as suspend is an immediate request and scrub is not critical - filesystem freezing before suspend was not sufficient as the problem was in process freezing - behaviour change: on suspend scrub and device replace are cancelled, where scrub can record the last state and continue from there; the device replace has to be restarted from the beginning - zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones Performance: - improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15% Notable fixes: - use vmalloc fallback when allocating bios as high order allocations can happen with wide checksums (like sha256) - scrub will always track the last position of progress so it's not starting from zero after an error Core: - under experimental config, checksum calculations are offloaded to process context, simplifies locking and allows to remove compression write worker kthread(s): - speed improvement in direct IO throughput with buffered IO fallback is +15% when not offloaded but this is more related to internal crypto subsystem improvements - this will be probably default in the future removing the sysfs tunable - (experimental) block size > page size updates: - support more operations when not using large folios (encoded read/write and send) - raid56 - more preparations for fscrypt support Other: - more conversions to auto-cleaned variables - parameter cleanups and removals - extended warning fixes - improved printing of structured values like keys - lots of other cleanups and refactoring" * tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: remove unnecessary inode key in btrfs_log_all_parents() btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() btrfs: remaining BTRFS_PATH_AUTO_FREE conversions btrfs: send: do not allocate memory for xattr data when checking it exists btrfs: send: add unlikely to all unexpected overflow checks btrfs: reduce arguments to btrfs_del_inode_ref_in_log() btrfs: remove root argument from btrfs_del_dir_entries_in_log() btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() btrfs: don't search back for dir inode item in INO_LOOKUP_USER btrfs: don't rewrite ret from inode_permission btrfs: add orig_logical to btrfs_bio for encryption btrfs: disable verity on encrypted inodes btrfs: disable various operations on encrypted inodes btrfs: remove redundant level reset in btrfs_del_items() btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() btrfs: optimize balance_level() path reference handling btrfs: factor out root promotion logic into promote_child_to_root() btrfs: raid56: remove the "_step" infix btrfs: raid56: enable bs > ps support btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases ...	2025-12-03 20:03:46 -08:00
David Sterba	10934c131f	btrfs: remaining BTRFS_PATH_AUTO_FREE conversions Do the remaining btrfs_path conversion to the auto cleaning, this seems to be the last one. Most of the conversions are trivial, only adding the declaration and removing the freeing, or changing the goto patterns to return. There are some functions with many changes, like __btrfs_free_extent(), btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree() but it still follows the same pattern. Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-25 01:53:33 +01:00
Filipe Manana	e21756fc4a	btrfs: use booleans for delalloc arguments and struct find_free_extent_ctl The struct find_free_extent_ctl uses an int for the 'delalloc' field but it's always used as a boolean, and its value is used to be passed to several functions to signal if we are dealing with delalloc. The same goes for the 'is_data' argument from btrfs_reserve_extent(). So change the type from int to bool and move the field definition in the find_free_extent_ctl structure so that it's close to other bool fields and reduces the size of the structure from 144 down to 136 bytes (at the moment it's only declared in the stack of btrfs_reserve_extent(), never allocated otherwise). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:26 +01:00
Filipe Manana	d7fe41044b	btrfs: use bool type for btrfs_path members used as booleans Many fields of struct btrfs_path are used as booleans but their type is an unsigned int (of one 1 bit width to save space). Change the type to bool keeping the :1 suffix so that they combine with the previous u8 fields in order to save space. This makes the code more clear by using explicit true/false and more in line with the preferred style, preserving the size of the structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:42:25 +01:00
Filipe Manana	a270cb420c	btrfs: reduce block group critical section in btrfs_add_reserved_bytes() We are doing some things inside the block group's critical section that are relevant only to the space_info: updating the space_info counters bytes_reserved and bytes_may_use as well as trying to grant tickets (calling btrfs_try_granting_tickets()), and this later can take some time. So move all those updates to outside the block group's critical section and still inside the space_info's critical section. Like this we keep the block group's critical section only for block group updates and can help reduce contention on a block group's lock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:14:40 +01:00
Filipe Manana	8b6fa164ab	btrfs: reduce block group critical section in btrfs_free_reserved_bytes() There's no need to update the space_info fields (bytes_reserved, max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the block group's spinlock. So move those updates to happen after we unlock the block group (and while holding the space_info locked of course), so that all we do under the block group's critical section is to update the block group itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:14:06 +01:00
Filipe Manana	f7a32dd2a6	btrfs: reduce space_info critical section in btrfs_chunk_alloc() There's no need to update local variables while holding the space_info's spinlock, since the update isn't using anything from the space_info. So move these updates outside the critical section to shorten it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:13:25 +01:00
Filipe Manana	a232ff90d1	btrfs: remove fs_info argument from btrfs_zoned_activate_one_bg() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 22:02:57 +01:00
Filipe Manana	e96059c9d7	btrfs: remove fs_info argument from btrfs_dump_space_info() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	78a77f4da4	btrfs: remove fs_info argument from btrfs_can_overcommit() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:10 +01:00
Filipe Manana	e3df6408b1	btrfs: remove fs_info argument from btrfs_try_granting_tickets() We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:59:09 +01:00
Boris Burkov	38e818718c	btrfs: fix racy bitfield write in btrfs_clear_space_info_full() From the memory-barriers.txt document regarding memory barrier ordering guarantees: () These guarantees do not apply to bitfields, because compilers often generate code to modify these using non-atomic read-modify-write sequences. Do not attempt to use bitfields to synchronize parallel algorithms. () Even in cases where bitfields are protected by locks, all fields in a given bitfield must be protected by one lock. If two fields in a given bitfield are protected by different locks, the compiler's non-atomic read-modify-write sequences can cause an update to one field to corrupt the value of an adjacent field. btrfs_space_info has a bitfield sharing an underlying word consisting of the fields full, chunk_alloc, and flush: struct btrfs_space_info { struct btrfs_fs_info * fs_info; /* 0 8 / struct btrfs_space_info parent; /* 8 8 / ... int clamp; / 172 4 / unsigned int full:1; / 176: 0 4 / unsigned int chunk_alloc:1; / 176: 1 4 / unsigned int flush:1; / 176: 2 4 */ ... Therefore, to be safe from parallel read-modify-writes losing a write to one of the bitfield members protected by a lock, all writes to all the bitfields must use the lock. They almost universally do, except for btrfs_clear_space_info_full() which iterates over the space_infos and writes out found->full = 0 without a lock. Imagine that we have one thread completing a transaction in which we finished deleting a block_group and are thus calling btrfs_clear_space_info_full() while simultaneously the data reclaim ticket infrastructure is running do_async_reclaim_data_space(): T1 T2 btrfs_commit_transaction btrfs_clear_space_info_full data_sinfo->full = 0 READ: full:0, chunk_alloc:0, flush:1 do_async_reclaim_data_space(data_sinfo) spin_lock(&space_info->lock); if(list_empty(tickets)) space_info->flush = 0; READ: full: 0, chunk_alloc:0, flush:1 MOD/WRITE: full: 0, chunk_alloc:0, flush:0 spin_unlock(&space_info->lock); return; MOD/WRITE: full:0, chunk_alloc:0, flush:1 and now data_sinfo->flush is 1 but the reclaim worker has exited. This breaks the invariant that flush is 0 iff there is no work queued or running. Once this invariant is violated, future allocations that go into __reserve_bytes() will add tickets to space_info->tickets but will see space_info->flush is set to 1 and not queue the work. After this, they will block forever on the resulting ticket, as it is now impossible to kick the worker again. I also confirmed by looking at the assembly of the affected kernel that it is doing RMW operations. For example, to set the flush (3rd) bit to 0, the assembly is: andb $0xfb,0x60(%rbx) and similarly for setting the full (1st) bit to 0: andb $0xfe,-0x20(%rax) So I think this is really a bug on practical systems. I have observed a number of systems in this exact state, but am currently unable to reproduce it. Rather than leaving this footgun lying around for the future, take advantage of the fact that there is room in the struct anyway, and that it is already quite large and simply change the three bitfield members to bools. This avoids writes to space_info->full having any effect on writes to space_info->flush, regardless of locking. Fixes: `957780eb27` ("Btrfs: introduce ticketed enospc infrastructure") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-11-24 21:38:49 +01:00
Christian Brauner	a5e3d0be9e	btrfs: use super write guard in btrfs_reclaim_bgs_work() Link: https://patch.msgid.link/20251104-work-guards-v1-2-5108ac78a171@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-05 22:52:15 +01:00
Linus Torvalds	f3827213ab	Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are no new features, the changes are in the core code, notably tree-log error handling and reporting improvements, and initial support for block size > page size. Performance improvements: - search data checksums in the commit root (previous transaction) to avoid locking contention, this improves parallelism of read heavy/low write workloads, and also reduces transaction commit time; on real and reproducer workload the sync time went from minutes to tens of seconds (workload and numbers are in the changelog) Core: - tree-log updates: - error handling improvements, transaction aborts - add new error state 'O' (printed in status messages) when log replay fails and is aborted - reduced number of btrfs_path allocations when traversing the tree - 'block size > page size' support - basic implementation with limitations, under experimental build - limitations: no direct io, raid56, encoded read (standalone and in send ioctl), encoded write - preparatory work for compression, removing implicit assumptions of page and block sizes - compression workspaces are now per-filesystem, we cannot assume common block size for work memory among different filesystems - tree-checker now verifies INODE_EXTREF item (which is implementing hardlinks) - tree leaf pretty printer updates, there were missing data from items, keys/items - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG, it's a debugging feature and not needed to be enabled separately - more struct btrfs_path auto free updates - use ref_tracker API for tracking delayed inodes, enabled by mount option 'ref_verify', allowing to better pinpoint leaking references - in zoned mode, avoid selecting data relocation zoned for ordinary data block groups - updated and enhanced error messages - lots of cleanups and refactoring" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits) btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() btrfs: add unlikely annotations to branches leading to transaction abort btrfs: add unlikely annotations to branches leading to EIO btrfs: add unlikely annotations to branches leading to EUCLEAN btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions btrfs: zoned: don't fail mount needlessly due to too many active zones btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() btrfs: enable experimental bs > ps support btrfs: add extra ASSERT()s to catch unaligned bios btrfs: fix symbolic link reading when bs > ps btrfs: prepare scrub to support bs > ps cases btrfs: prepare zlib to support bs > ps cases btrfs: prepare lzo to support bs > ps cases btrfs: prepare zstd to support bs > ps cases btrfs: prepare compression folio alloc/free for bs > ps cases btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() btrfs: remove pointless key offset setup in create_pending_snapshot() btrfs: annotate btrfs_is_testing() as unlikely and make it return bool btrfs: make the rule checking more readable for should_cow_block() btrfs: simplify inline extent end calculation at replay_one_extent() ...	2025-09-30 08:14:49 -07:00
Linus Torvalds	b786405685	Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq	2025-09-29 10:27:17 -07:00
David Sterba	a929904cf7	btrfs: add unlikely annotations to branches leading to transaction abort The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
David Sterba	9264d004a6	btrfs: add unlikely annotations to branches leading to EUCLEAN The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:26 +02:00
David Sterba	17dc82dc1e	btrfs: fix typos in comments and strings Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-23 08:49:16 +02:00
David Sterba	67e78f983e	btrfs: convert several int parameters to bool We're almost done cleaning misused int/bool parameters. Convert a bunch of them, found by manual grepping. Note that btrfs_sync_fs() needs an int as it's mandated by the struct super_operations prototype. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-22 10:54:32 +02:00
Marco Crivellari	7a4f92d39f	fs: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 16:15:07 +02:00
Filipe Manana	80eb65ccf6	btrfs: annotate block group access with data_race() when sorting for reclaim When sorting the block group list for reclaim we are using a block group's used bytes counter without taking the block group's spinlock, so we can race with a concurrent task updating it (at btrfs_update_block_group()), which makes tools like KCSAN unhappy and report a race. Since the sorting is not strictly needed from a functional perspective and such races should rarely cause any ordering changes (only load/store tearing could cause them), not to mention that after the sorting the ordering may no longer be accurate due to concurrent allocations and deallocations of extents in a block group, annotate the accesses to the used counter with data_race() to silence KCSAN and similar tools. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-09-15 05:25:43 +02:00
Filipe Manana	55fae08a06	btrfs: unfold transaction aborts when writing dirty block groups We have a single transaction abort call that can be due to an error from one of two calls to update_block_group_item(). Unfold the transaction abort calls so that if they happen we know which update_block_group_item() call failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 01:14:08 +02:00
Naohiro Aota	62be7afcc1	btrfs: zoned: requeue to unused block group list if zone finish failed btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need to try it again later. So, put the block group to the retry list properly. Failing to do so will keep the removable block group intact until remount and can causes unnecessary ENOSPC. Fixes: `74e91b12b1` ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:23 +02:00
Naohiro Aota	3061801420	btrfs: zoned: do not remove unwritten non-data block group There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: `afba2bc036` ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:23 +02:00
Filipe Manana	d6be378de0	btrfs: remove btrfs_clear_extent_bits() It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Johannes Thumshirn	5ae011bcbb	btrfs: don't print relocation messages from auto reclaim When BTRFS is doing automatic block-group reclaim, it is spamming the kernel log messages a lot. Add a 'verbose' parameter to btrfs_relocate_chunk() and btrfs_relocate_block_group() to control the verbosity of these log message. This way the old behaviour of printing log messages on a user-space initiated balance operation can be kept while excessive log spamming due to auto reclaim is mitigated. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Johannes Thumshirn	a507904090	btrfs: remove redundant auto reclaim log message Remove the log message before reclaiming a chunk in btrfs_reclaim_bgs_work(). Especially with automatic block-group reclaiming these messages spam the kernel log. Note there is also a tracepoint for the same condition to ease debugging. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-22 00:09:22 +02:00
Johannes Thumshirn	9669fcb77e	btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean although it is defined as an integer. Change it from int to bool. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:05 +02:00
Filipe Manana	6fc5ef7829	btrfs: add btrfs prefix to free space tree exported functions A few of the free space tree exported functions have a 'btrfs_' prefix in their name, but most don't. Not only is this inconsistent, the preferred style is to have such a prefix, to avoid potential collisions in the future with other kernel code and offer a somewhat better readibility by making it obvious in calls sites that we are calling btrfs specific code. So add the 'btrfs_' prefix to all free space tree functions that are missing it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-07-21 23:58:02 +02:00
Filipe Manana	ba4ec9a5a0	btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() We are using an integer for the 'delalloc' argument but all we need is a boolean, so switch the type to 'bool' and rename the parameter to 'is_delalloc' to better match the fact that it's a boolean. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:56 +02:00
Filipe Manana	a20f732822	btrfs: simplify getting and extracting previous transaction at clean_pinned_extents() Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). We are also extracting the previous transaction with list_last_entry() against the transaction, which is correct but confusing because that function is usually meant to be used against a pointer to the start of a list and not a member of a list. It is easier to reason by either calling list_first_entry() against the list fs_info->trans_list, since we can never have more than two transactions in the list, or by calling list_prev_entry() against the transaction. So change that to use the later method. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:54 +02:00
Naohiro Aota	cc0517fe77	btrfs: tweak extent/chunk allocation for space_info sub-space Make the extent allocator and the chunk allocator aware of the sub-space. It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group. And, it needs to check the space_info is the right one when a block group candidate is given. Also, new block group should now belong to the specified one. Now that, block_group->space_info is always set before btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info. So, rename the variable name to address that as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	f92ee31e03	btrfs: introduce btrfs_space_info sub-group Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	4d5a047e07	btrfs: add space_info parameter for block group creation Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	098a442d5b	btrfs: add space_info argument to btrfs_chunk_alloc() Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group will belong to that space_info. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
Naohiro Aota	1cfdbe0d53	btrfs: factor out check_removing_space_info() from btrfs_free_block_groups() Factor out check_removing_space_info() from btrfs_free_block_groups(). It sanity checks a to-be-removed space_info. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:53 +02:00
David Sterba	f963e0128b	btrfs: trivial conversion to return bool instead of int Old code has a lot of int for bool return values, bool is recommended and done in new code. Convert the trivial cases that do simple 0/false and 1/true. Functions comment are updated if needed. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-05-15 14:30:49 +02:00

1 2 3 4 5 ...

319 Commits