Commit Graph

319 Commits

Author SHA1 Message Date
Sun YangKai
19eff93dc7 btrfs: fix periodic reclaim condition
Problems with current implementation:

1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
   negative reclaimable_bytes to trigger reclaim unexpectedly

2. The "space must be freed between scans" assumption breaks the
   two-scan requirement: first scan marks block groups, second scan
   reclaims them. Without the second scan, no reclamation occurs.

Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.

And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.

Fixes: 813d4c6422 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:37 +01:00
Filipe Manana
4ac81c3811 btrfs: use the btrfs_block_group_end() helper everywhere
We have a helper to calculate a block group's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
2aef934b56 btrfs: populate fully_remapped_bgs_list on mount
Add a function btrfs_populate_fully_remapped_bgs_list() which gets
called on mount, which looks for fully remapped block groups
(i.e. identity_remap_count == 0) which haven't yet had their chunk
stripes and device extents removed.

This happens when a filesystem is unmounted while async discard has not
yet finished, as otherwise the data range occupied by the chunk stripes
would be permanently unusable.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
7cddbb4339 btrfs: handle discarding fully-remapped block groups
Discard normally works by iterating over the free-space entries of a
block group. This doesn't work for fully-remapped block groups, as we
removed their free-space entries when we started relocation.

For sync discard, call btrfs_discard_extent() when we commit the
transaction in which the last identity remap was removed.

For async discard, add a new function btrfs_trim_fully_remapped_block_group()
to be called by the discard worker, which iterates over the block
group's range using the normal async discard rules. Once we reach the
end, remove the chunk's stripes and device extents to get back its free
space.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:36 +01:00
Mark Harmstone
b56f35560b btrfs: handle setting up relocation of block group with remap-tree
Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with METADATA_REMAP blocks, which
are dealt with in a later patch.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

(Thanks to Sun YangKai for his suggestions.)

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
979e1dc3d6 btrfs: handle deletions from remapped block group
Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

Fully remapped block groups have their chunk stripes removed and their
device extents freed, which makes the disk space available again to the
chunk allocator. This happens asynchronously: in the cleaner thread for
sync discard and nodiscard, and (in a later patch) in the discard worker
for async discard.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone
7977011460 btrfs: add extended version of struct block_group_item
Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.

This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.

`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.

`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Mark Harmstone
bf8ff4b9f0 btrfs: rename struct btrfs_block_group field commit_used to last_used
Rename the field commit_used in struct btrfs_block_group to last_used,
for clarity and consistency with the similar fields we're about to add.
It's not obvious that commit_flags means "flags as of the last commit"
rather than "flags related to a commit".

Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Mark Harmstone
76377db18a btrfs: remove remapped block groups from the free-space-tree
No new allocations can be done from block groups that have the REMAPPED
flag set, so there's no value in their having entries in the free-space
tree.

Prevent a search through the free-space tree being scheduled for such a
block group, and prevent any additions to the in-memory free-space tree.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:34 +01:00
Filipe Manana
c208aa0ef6 btrfs: add and use helper to compute the available space for a block group
We have currently three places that compute how much available space a
block group has. Add a helper function for this and use it in those
places.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:44 +01:00
David Sterba
e582f22030 btrfs: split btrfs_fs_closing() and change return type to bool
There are two tests in btrfs_fs_closing() but checking the
BTRFS_FS_CLOSING_DONE bit is done only in one place
load_extent_tree_free(). As this is an inline we can reduce size of the
generated code. The types can be also changed to bool as this becomes a
simple condition.

   text    data     bss     dec     hex filename
1674006  146704   15560 1836270  1c04ee pre/btrfs.ko
1673772  146704   15560 1836036  1c0404 post/btrfs.ko

DELTA: -234

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:42 +01:00
Johannes Thumshirn
a464ed9834 btrfs: rename btrfs_create_block_group_cache to btrfs_create_block_group
struct btrfs_block_group used to be called struct
btrfs_block_group_cache but got renamed to btrfs_block_group with
commit 32da5386d9 ("btrfs: rename btrfs_block_group_cache").

Rename btrfs_create_block_group_cache() to btrfs_create_block_group() to
reflect that change.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:48:44 +01:00
David Sterba
4b117be65f btrfs: merge setting ret and return ret
In many places we have pattern:

	ret = ...;
	return ret;

This can be simplified to a direct return, removing 'ret' if not
otherwise needed. The places in self tests are not converted so we can
add more test cases without changing surrounding code
(extent-map-tests.c:test_case_4()).

Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 06:38:33 +01:00
Linus Torvalds
7696286034 Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "Features:

   - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
      - set filesystem state as being shut down (also named going down
        in other filesystems), where all active operations return EIO
        and this cannot be changed until unmount
      - pending operations are attempted to be finished but error
        messages may still show up depending on where exactly the
        shutdown happened

   - scrub (and device replace) vs suspend/hibernate:
      - a running scrub will prevent suspend, which can be annoying as
        suspend is an immediate request and scrub is not critical
      - filesystem freezing before suspend was not sufficient as the
        problem was in process freezing
      - behaviour change: on suspend scrub and device replace are
        cancelled, where scrub can record the last state and continue
        from there; the device replace has to be restarted from the
        beginning

   - zone stats exported in sysfs, from the perspective of the
     filesystem this includes active, reclaimable, relocation etc zones

  Performance:

   - improvements when processing space reservation tickets by
     optimizing locking and shrinking critical sections, cumulative
     improvements in lockstat numbers show +15%

  Notable fixes:

   - use vmalloc fallback when allocating bios as high order allocations
     can happen with wide checksums (like sha256)

   - scrub will always track the last position of progress so it's not
     starting from zero after an error

  Core:

   - under experimental config, checksum calculations are offloaded to
     process context, simplifies locking and allows to remove
     compression write worker kthread(s):
      - speed improvement in direct IO throughput with buffered IO
        fallback is +15% when not offloaded but this is more related to
        internal crypto subsystem improvements
      - this will be probably default in the future removing the sysfs
        tunable

   - (experimental) block size > page size updates:
      - support more operations when not using large folios (encoded
        read/write and send)
      - raid56

   - more preparations for fscrypt support

  Other:

   - more conversions to auto-cleaned variables

   - parameter cleanups and removals

   - extended warning fixes

   - improved printing of structured values like keys

   - lots of other cleanups and refactoring"

* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
  btrfs: remove unnecessary inode key in btrfs_log_all_parents()
  btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
  btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
  btrfs: send: do not allocate memory for xattr data when checking it exists
  btrfs: send: add unlikely to all unexpected overflow checks
  btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
  btrfs: remove root argument from btrfs_del_dir_entries_in_log()
  btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
  btrfs: don't search back for dir inode item in INO_LOOKUP_USER
  btrfs: don't rewrite ret from inode_permission
  btrfs: add orig_logical to btrfs_bio for encryption
  btrfs: disable verity on encrypted inodes
  btrfs: disable various operations on encrypted inodes
  btrfs: remove redundant level reset in btrfs_del_items()
  btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
  btrfs: optimize balance_level() path reference handling
  btrfs: factor out root promotion logic into promote_child_to_root()
  btrfs: raid56: remove the "_step" infix
  btrfs: raid56: enable bs > ps support
  btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
  ...
2025-12-03 20:03:46 -08:00
David Sterba
10934c131f btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.

There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25 01:53:33 +01:00
Filipe Manana
e21756fc4a btrfs: use booleans for delalloc arguments and struct find_free_extent_ctl
The struct find_free_extent_ctl uses an int for the 'delalloc' field but
it's always used as a boolean, and its value is used to be passed to
several functions to signal if we are dealing with delalloc. The same goes
for the 'is_data' argument from btrfs_reserve_extent(). So change the type
from int to bool and move the field definition in the find_free_extent_ctl
structure so that it's close to other bool fields and reduces the size of
the structure from 144 down to 136 bytes (at the moment it's only declared
in the stack of btrfs_reserve_extent(), never allocated otherwise).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:26 +01:00
Filipe Manana
d7fe41044b btrfs: use bool type for btrfs_path members used as booleans
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:25 +01:00
Filipe Manana
a270cb420c btrfs: reduce block group critical section in btrfs_add_reserved_bytes()
We are doing some things inside the block group's critical section that
are relevant only to the space_info: updating the space_info counters
bytes_reserved and bytes_may_use as well as trying to grant tickets
(calling btrfs_try_granting_tickets()), and this later can take some
time. So move all those updates to outside the block group's critical
section and still inside the space_info's critical section. Like this
we keep the block group's critical section only for block group updates
and can help reduce contention on a block group's lock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:14:40 +01:00
Filipe Manana
8b6fa164ab btrfs: reduce block group critical section in btrfs_free_reserved_bytes()
There's no need to update the space_info fields (bytes_reserved,
max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the
block group's spinlock. So move those updates to happen after we unlock
the block group (and while holding the space_info locked of course), so
that all we do under the block group's critical section is to update the
block group itself.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:14:06 +01:00
Filipe Manana
f7a32dd2a6 btrfs: reduce space_info critical section in btrfs_chunk_alloc()
There's no need to update local variables while holding the space_info's
spinlock, since the update isn't using anything from the space_info. So
move these updates outside the critical section to shorten it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:13:25 +01:00
Filipe Manana
a232ff90d1 btrfs: remove fs_info argument from btrfs_zoned_activate_one_bg()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:02:57 +01:00
Filipe Manana
e96059c9d7 btrfs: remove fs_info argument from btrfs_dump_space_info()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana
78a77f4da4 btrfs: remove fs_info argument from btrfs_can_overcommit()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana
e3df6408b1 btrfs: remove fs_info argument from btrfs_try_granting_tickets()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Boris Burkov
38e818718c btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
From the memory-barriers.txt document regarding memory barrier ordering
guarantees:

 (*) These guarantees do not apply to bitfields, because compilers often
     generate code to modify these using non-atomic read-modify-write
     sequences.  Do not attempt to use bitfields to synchronize parallel
     algorithms.

 (*) Even in cases where bitfields are protected by locks, all fields
     in a given bitfield must be protected by one lock.  If two fields
     in a given bitfield are protected by different locks, the compiler's
     non-atomic read-modify-write sequences can cause an update to one
     field to corrupt the value of an adjacent field.

btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:

struct btrfs_space_info {
        struct btrfs_fs_info *     fs_info;              /*     0     8 */
        struct btrfs_space_info *  parent;               /*     8     8 */
        ...
        int                        clamp;                /*   172     4 */
        unsigned int               full:1;               /*   176: 0  4 */
        unsigned int               chunk_alloc:1;        /*   176: 1  4 */
        unsigned int               flush:1;              /*   176: 2  4 */
        ...

Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.

Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():

          T1                                             T2
btrfs_commit_transaction
  btrfs_clear_space_info_full
  data_sinfo->full = 0
  READ: full:0, chunk_alloc:0, flush:1
                                              do_async_reclaim_data_space(data_sinfo)
                                              spin_lock(&space_info->lock);
                                              if(list_empty(tickets))
                                                space_info->flush = 0;
                                                READ: full: 0, chunk_alloc:0, flush:1
                                                MOD/WRITE: full: 0, chunk_alloc:0, flush:0
                                                spin_unlock(&space_info->lock);
                                                return;
  MOD/WRITE: full:0, chunk_alloc:0, flush:1

and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.

I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
  andb    $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
  andb    $0xfe,-0x20(%rax)

So I think this is really a bug on practical systems.  I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.

Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
writes to space_info->flush, regardless of locking.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:38:49 +01:00
Christian Brauner
a5e3d0be9e btrfs: use super write guard in btrfs_reclaim_bgs_work()
Link: https://patch.msgid.link/20251104-work-guards-v1-2-5108ac78a171@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05 22:52:15 +01:00
Linus Torvalds
f3827213ab Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "There are no new features, the changes are in the core code, notably
  tree-log error handling and reporting improvements, and initial
  support for block size > page size.

  Performance improvements:

   - search data checksums in the commit root (previous transaction) to
     avoid locking contention, this improves parallelism of read
     heavy/low write workloads, and also reduces transaction commit
     time; on real and reproducer workload the sync time went from
     minutes to tens of seconds (workload and numbers are in the
     changelog)

  Core:

   - tree-log updates:
      - error handling improvements, transaction aborts
      - add new error state 'O' (printed in status messages) when log
        replay fails and is aborted
      - reduced number of btrfs_path allocations when traversing the
        tree

   - 'block size > page size' support
      - basic implementation with limitations, under experimental build
      - limitations: no direct io, raid56, encoded read (standalone and
        in send ioctl), encoded write
      - preparatory work for compression, removing implicit assumptions
        of page and block sizes
      - compression workspaces are now per-filesystem, we cannot assume
        common block size for work memory among different filesystems

   - tree-checker now verifies INODE_EXTREF item (which is implementing
     hardlinks)

   - tree leaf pretty printer updates, there were missing data from
     items, keys/items

   - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
     it's a debugging feature and not needed to be enabled separately

   - more struct btrfs_path auto free updates

   - use ref_tracker API for tracking delayed inodes, enabled by mount
     option 'ref_verify', allowing to better pinpoint leaking references

   - in zoned mode, avoid selecting data relocation zoned for ordinary
     data block groups

   - updated and enhanced error messages

   - lots of cleanups and refactoring"

* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
  btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
  btrfs: add unlikely annotations to branches leading to transaction abort
  btrfs: add unlikely annotations to branches leading to EIO
  btrfs: add unlikely annotations to branches leading to EUCLEAN
  btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
  btrfs: zoned: don't fail mount needlessly due to too many active zones
  btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
  btrfs: enable experimental bs > ps support
  btrfs: add extra ASSERT()s to catch unaligned bios
  btrfs: fix symbolic link reading when bs > ps
  btrfs: prepare scrub to support bs > ps cases
  btrfs: prepare zlib to support bs > ps cases
  btrfs: prepare lzo to support bs > ps cases
  btrfs: prepare zstd to support bs > ps cases
  btrfs: prepare compression folio alloc/free for bs > ps cases
  btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
  btrfs: remove pointless key offset setup in create_pending_snapshot()
  btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
  btrfs: make the rule checking more readable for should_cow_block()
  btrfs: simplify inline extent end calculation at replay_one_extent()
  ...
2025-09-30 08:14:49 -07:00
Linus Torvalds
b786405685 Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
 "This contains various workqueue changes affecting the filesystem
  layer.

  Currently if a user enqueue a work item using schedule_delayed_work()
  the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
  WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
  to schedule_work() that is using system_wq and queue_work(), that
  makes use again of WORK_CPU_UNBOUND.

  This replaces the use of system_wq and system_unbound_wq. system_wq is
  a per-CPU workqueue which isn't very obvious from the name and
  system_unbound_wq is to be used when locality is not required.

  So this renames system_wq to system_percpu_wq, and system_unbound_wq
  to system_dfl_wq.

  This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
  explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
  WQ_PERCPU flags coexist for one release cycle to allow callers to
  transition their calls. WQ_UNBOUND will be removed in a next release
  cycle"

* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: WQ_PERCPU added to alloc_workqueue users
  fs: replace use of system_wq with system_percpu_wq
  fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29 10:27:17 -07:00
David Sterba
a929904cf7 btrfs: add unlikely annotations to branches leading to transaction abort
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.

Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.

The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba
9264d004a6 btrfs: add unlikely annotations to branches leading to EUCLEAN
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba
17dc82dc1e btrfs: fix typos in comments and strings
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
David Sterba
67e78f983e btrfs: convert several int parameters to bool
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping.  Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:32 +02:00
Marco Crivellari
7a4f92d39f fs: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19 16:15:07 +02:00
Filipe Manana
80eb65ccf6 btrfs: annotate block group access with data_race() when sorting for reclaim
When sorting the block group list for reclaim we are using a block group's
used bytes counter without taking the block group's spinlock, so we can
race with a concurrent task updating it (at btrfs_update_block_group()),
which makes tools like KCSAN unhappy and report a race.

Since the sorting is not strictly needed from a functional perspective
and such races should rarely cause any ordering changes (only load/store
tearing could cause them), not to mention that after the sorting the
ordering may no longer be accurate due to concurrent allocations and
deallocations of extents in a block group, annotate the accesses to the
used counter with data_race() to silence KCSAN and similar tools.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-15 05:25:43 +02:00
Filipe Manana
55fae08a06 btrfs: unfold transaction aborts when writing dirty block groups
We have a single transaction abort call that can be due to an error from
one of two calls to update_block_group_item(). Unfold the transaction
abort calls so that if they happen we know which update_block_group_item()
call failed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 01:14:08 +02:00
Naohiro Aota
62be7afcc1 btrfs: zoned: requeue to unused block group list if zone finish failed
btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need
to try it again later. So, put the block group to the retry list properly.

Failing to do so will keep the removable block group intact until remount
and can causes unnecessary ENOSPC.

Fixes: 74e91b12b1 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:23 +02:00
Naohiro Aota
3061801420 btrfs: zoned: do not remove unwritten non-data block group
There are some reports of "unable to find chunk map for logical 2147483648
length 16384" error message appears in dmesg. This means some IOs are
occurring after a block group is removed.

When a metadata tree node is cleaned on a zoned setup, we keep that node
still dirty and write it out not to create a write hole. However, this can
make a block group's used bytes == 0 while there is a dirty region left.

Such an unused block group is moved into the unused_bg list and processed
for removal. When the removal succeeds, the block group is removed from the
transaction->dirty_bgs list, so the unused dirty nodes in the block group
are not sent at the transaction commit time. It will be written at some
later time e.g, sync or umount, and causes "unable to find chunk map"
errors.

This can happen relatively easy on SMR whose zone size is 256MB. However,
calling do_zone_finish() on such block group returns -EAGAIN and keep that
block group intact, which is why the issue is hidden until now.

Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:23 +02:00
Filipe Manana
d6be378de0 btrfs: remove btrfs_clear_extent_bits()
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Johannes Thumshirn
5ae011bcbb btrfs: don't print relocation messages from auto reclaim
When BTRFS is doing automatic block-group reclaim, it is spamming the
kernel log messages a lot.

Add a 'verbose' parameter to btrfs_relocate_chunk() and
btrfs_relocate_block_group() to control the verbosity of these log
message. This way the old behaviour of printing log messages on a
user-space initiated balance operation can be kept while excessive log
spamming due to auto reclaim is mitigated.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Johannes Thumshirn
a507904090 btrfs: remove redundant auto reclaim log message
Remove the log message before reclaiming a chunk in
btrfs_reclaim_bgs_work(). Especially with automatic block-group
reclaiming these messages spam the kernel log.

Note there is also a tracepoint for the same condition to ease debugging.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:22 +02:00
Johannes Thumshirn
9669fcb77e btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool
btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean
although it is defined as an integer.

Change it from int to bool.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:05 +02:00
Filipe Manana
6fc5ef7829 btrfs: add btrfs prefix to free space tree exported functions
A few of the free space tree exported functions have a 'btrfs_' prefix in
their name, but most don't. Not only is this inconsistent, the preferred
style is to have such a prefix, to avoid potential collisions in the
future with other kernel code and offer a somewhat better readibility by
making it obvious in calls sites that we are calling btrfs specific code.

So add the 'btrfs_' prefix to all free space tree functions that are
missing it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:02 +02:00
Filipe Manana
ba4ec9a5a0 btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:56 +02:00
Filipe Manana
a20f732822 btrfs: simplify getting and extracting previous transaction at clean_pinned_extents()
Instead of detecting if there is a previous transaction by comparing the
current transaction's list prev member to the head of the transaction
list (fs_info->trans_list), use the list_is_first() helper which contains
that logic and the naming makes sense since a new transaction is always
added to the end of the list fs_info->trans_list with list_add_tail().

We are also extracting the previous transaction with list_last_entry()
against the transaction, which is correct but confusing because that
function is usually meant to be used against a pointer to the start of a
list and not a member of a list. It is easier to reason by either calling
list_first_entry() against the list fs_info->trans_list, since we can
never have more than two transactions in the list, or by calling
list_prev_entry() against the transaction. So change that to use the later
method.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:54 +02:00
Naohiro Aota
cc0517fe77 btrfs: tweak extent/chunk allocation for space_info sub-space
Make the extent allocator and the chunk allocator aware of the sub-space.
It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block
group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group.

And, it needs to check the space_info is the right one when a block group
candidate is given. Also, new block group should now belong to the
specified one.

Now that, block_group->space_info is always set before
btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info.
So, rename the variable name to address that as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Naohiro Aota
f92ee31e03 btrfs: introduce btrfs_space_info sub-group
Current code assumes we have only one space_info for each block group type
(DATA, METADATA, and SYSTEM). We sometime need multiple space infos to
manage special block groups.

One example is handling the data relocation block group for the zoned mode.
That block group is dedicated for writing relocated data and we cannot
allocate any regular extent from that block group, which is implemented in
the zoned extent allocator. This block group still belongs to the normal
data space_info. So, when all the normal data block groups are full and
there is some free space in the dedicated block group, the space_info
looks to have some free space, while it cannot allocate normal extent
anymore. That results in a strange ENOSPC error. We need to have a
space_info for the relocation data block group to represent the situation
properly.

Adds a basic infrastructure for having a "sub-group" of a space_info:
creation and removing. A sub-group space_info belongs to one of the
primary space_infos and has the same flags as its parent.

This commit first introduces the relocation data sub-space_info, and the
next commit will introduce tree-log sub-space_info. In the future, it could
be useful to implement tiered storage for btrfs e.g. by implementing a
sub-group space_info for block groups resides on a fast storage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Naohiro Aota
4d5a047e07 btrfs: add space_info parameter for block group creation
Add struct btrfs_space_info parameter to btrfs_make_block_group(), its
related functions and related struct. Passed space_info will have a new
block group.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Naohiro Aota
098a442d5b btrfs: add space_info argument to btrfs_chunk_alloc()
Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group
will belong to that space_info.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
Naohiro Aota
1cfdbe0d53 btrfs: factor out check_removing_space_info() from btrfs_free_block_groups()
Factor out check_removing_space_info() from btrfs_free_block_groups(). It
sanity checks a to-be-removed space_info. There is no functional change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:53 +02:00
David Sterba
f963e0128b btrfs: trivial conversion to return bool instead of int
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00