Commit 522544fc authored May 26, 2025 by Linus Torvalds

bcachefs

Pull bcachefs updates from Kent Overstreet:

 - Poisoned extents can now be moved: this lets us handle bitrotted data
   without deleting it. For now, reading from poisoned extents only
   returns -EIO: in the future we'll have an API for specifying "read
   this data even if there were bitflips".

 - Incompatible features may now be enabled at runtime, via
   "opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
   toggle it back - option changes via the sysfs interface are
   persistent.

 - Various changes to support deployable disk images:

     - RO mounts now use less memory

     - Images may be stripped of alloc info, particularly useful for
       slimming them down if they will primarily be mounted RO. Alloc
       info will be automatically regenerated on first RW mount, and
       this is quite fast

     - Filesystem images generated with 'bcachefs image' will be
       automatically resized the first time they're mounted on a larger
       device

   The images 'bcachefs image' generates with compression enabled have
   been comparable in size to those generated by squashfs and erofs -
   but you get a full RW capable filesystem

 - Major error message improvements for btree node reads, data reads,
   and elsewhere. We now build up a single error message that lists all
   the errors encountered, actions taken to repair, and success/failure
   of the IO. This extends to other error paths that may kick off other
   actions, e.g. scheduling recovery passes: actions we took because of
   an error are included in that error message, with
   grouping/indentation so we can see what caused what.

 - New option, 'rebalance_on_ac_only'. Does exactly what the name
   suggests, quite handy with background compression.

 - Repair/self healing:

     - We can now kick off recovery passes and run them in the
       background if we detect errors. Currently, this is just used by
       code that walks backpointers. We now also check for missing
       backpointers at runtime and run check_extents_to_backpointers if
       required. The messy 6.14 upgrade left missing backpointers for
       some users, and this will correct that automatically instead of
       requiring a manual fsck - some users noticed this as copygc
       spinning and not making progress.

       In the future, as more recovery passes come online, we'll be able
       to repair and recover from nearly anything - except for
       unreadable btree nodes, and that's why you're using replication,
       of course - without shutting down the filesystem.

     - There's a new recovery pass, for checking the rebalance_work
       btree, which tracks extents that rebalance will process later.

 - Hardening:

     - Close the last known hole in btree iterator/btree locking
       assertions: path->should_be_locked paths must stay locked until
       the end of the transaction. This shook out a few bugs, including
       a performance issue that was causing unnecessary path_upgrade
       transaction restarts.

 - Performance:

     - Faster snapshot deletion: this is an incompatible feature, as it
       requires new sentinal values, for safety. Snapshot deletion no
       longer has to do a full metadata scan, it now just scans the
       inodes btree: if an extent/dirent/xattr is present for a given
       snapshot ID, we already require that an inode be present with
       that same snapshot ID.

       If/when users hit scalability limits again (ridiculously huge
       filesystems with lots of inodes, and many sparse snapshots), let
       me know - the next step will be to add an index from snapshot ID
       -> inode number, which won't be too hard.

     - Faster device removal: the "scan for pointers to this device" no
       longer does a full metadata scan, instead it walks backpointers.
       Like fast snapshot deletion this is another incompat feature: it
       also requires a new sentinal value, because we don't want to
       reuse these device IDs until after a fsck.

     - We're now coalescing redundant accounting updates prior to
       transaction commit, taking some pressure off the journal. Shortly
       we'll also be doing multiple extent updates in a transaction in
       the main write path, which combined with the previous should
       drastically cut down on the amount of metadata updates we have to
       journal.

 - Stack usage improvements: All allocator state has been moved off the
   stack

 - Debug improvements:

     - enumerated refcounts: The debug code previously used for
       filesystem write refs is now a small library, and used for other
       heavily used refcounts. Different users of a refcount are
       enumerated, making it much easier to debug refcount issues.

     - Async object debugging: There's a new kconfig option that makes
       various async objects (different types of bios, data updates,
       write ops, etc.) visible in debugfs, and it should be fast enough
       to leave on in production.

     - Various sets of assertions no longer require
       CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
       parameters and static keys, meaning users won't need to compile
       custom kernels as often to help debug issues.

     - bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
       option). With it on you can check the btree_transaction_stats in
       debugfs to see the bch2_trans_kmalloc() calls a transaction did
       when it used the most memory.

* tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
  bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
  bcachefs: Fix btree_iter_next_node() for new locking asserts
  bcachefs: Ensure we don't use a blacklisted journal seq
  bcachefs: Small check_fix_ptr fixes
  bcachefs: Fix opts.recovery_pass_last
  bcachefs: Fix allocate -> self healing path
  bcachefs: Fix endianness in casefold check/repair
  bcachefs: Path must be locked if trans->locked && should_be_locked
  bcachefs: Simplify bch2_path_put()
  bcachefs: Plumb btree_trans for more locking asserts
  bcachefs: Clear trans->locked before unlock
  bcachefs: Clear should_be_locked before unlock in key_cache_drop()
  bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
  bcachefs: Give out new path if upgrade fails
  bcachefs: Fix btree_path_get_locks when not doing trans restart
  bcachefs: btree_node_locked_type_nowrite()
  bcachefs: Kill bch2_path_put_nokeep()
  bcachefs: bch2_journal_write_checksum()
  bcachefs: Reduce stack usage in data_update_index_update()
  bcachefs: bch2_trans_log_str()
  ...

parents 8fdabcd9 9caea920

Documentation/filesystems/bcachefs/casefolding.rst

+18 −0

Original line number	Diff line number	Diff line
		@@ -88,3 +88,21 @@ This would fail if negative dentry's were cached.

		This is slightly suboptimal, but could be fixed in future with some vfs work.


		References
		----------

		(from Peter Anvin, on the list)

		It is worth noting that Microsoft has basically declared their
		"recommended" case folding (upcase) table to be permanently frozen (for
		new filesystem instances in the case where they use an on-disk
		translation table created at format time.) As far as I know they have
		never supported anything other than 1:1 conversion of BMP code points,
		nor normalization.

		The exFAT specification enumerates the full recommended upcase table,
		although in a somewhat annoying format (basically a hex dump of
		compressed data):

		https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification

Documentation/filesystems/bcachefs/future/idle_work.rst

0 → 100644

+78 −0

Original line number	Diff line number	Diff line
		Idle/background work classes design doc:

		Right now, our behaviour at idle isn't ideal, it was designed for servers that
		would be under sustained load, to keep pending work at a "medium" level, to
		let work build up so we can process it in more efficient batches, while also
		giving headroom for bursts in load.

		But for desktops or mobile - scenarios where work is less sustained and power
		usage is more important - we want to operate differently, with a "rush to
		idle" so the system can go to sleep. We don't want to be dribbling out
		background work while the system should be idle.

		The complicating factor is that there are a number of background tasks, which
		form a heirarchy (or a digraph, depending on how you divide it up) - one
		background task may generate work for another.

		Thus proper idle detection needs to model this heirarchy.

		- Foreground writes
		- Page cache writeback
		- Copygc, rebalance
		- Journal reclaim

		When we implement idle detection and rush to idle, we need to be careful not
		to disturb too much the existing behaviour that works reasonably well when the
		system is under sustained load (or perhaps improve it in the case of
		rebalance, which currently does not actively attempt to let work batch up).

		SUSTAINED LOAD REGIME
		---------------------

		When the system is under continuous load, we want these jobs to run
		continuously - this is perhaps best modelled with a P/D controller, where
		they'll be trying to keep a target value (i.e. fragmented disk space,
		available journal space) roughly in the middle of some range.

		The goal under sustained load is to balance our ability to handle load spikes
		without running out of x resource (free disk space, free space in the
		journal), while also letting some work accumululate to be batched (or become
		unnecessary).

		For example, we don't want to run copygc too aggressively, because then it
		will be evacuating buckets that would have become empty (been overwritten or
		deleted) anyways, and we don't want to wait until we're almost out of free
		space because then the system will behave unpredicably - suddenly we're doing
		a lot more work to service each write and the system becomes much slower.

		IDLE REGIME
		-----------

		When the system becomes idle, we should start flushing our pending work
		quicker so the system can go to sleep.

		Note that the definition of "idle" depends on where in the heirarchy a task
		is - a task should start flushing work more quickly when the task above it has
		stopped generating new work.

		e.g. rebalance should start flushing more quickly when page cache writeback is
		idle, and journal reclaim should only start flushing more quickly when both
		copygc and rebalance are idle.

		It's important to let work accumulate when more work is still incoming and we
		still have room, because flushing is always more efficient if we let it batch
		up. New writes may overwrite data before rebalance moves it, and tasks may be
		generating more updates for the btree nodes that journal reclaim needs to flush.

		On idle, how much work we do at each interval should be proportional to the
		length of time we have been idle for. If we're idle only for a short duration,
		we shouldn't flush everything right away; the system might wake up and start
		generating new work soon, and flushing immediately might end up doing a lot of
		work that would have been unnecessary if we'd allowed things to batch more.

		To summarize, we will need:

		- A list of classes for background tasks that generate work, which will
		include one "foreground" class.
		- Tracking for each class - "Am I doing work, or have I gone to sleep?"
		- And each class should check the class above it when deciding how much work to issue.

Documentation/filesystems/bcachefs/index.rst

+7 −0

Original line number	Diff line number	Diff line
		@@ -29,3 +29,10 @@ At this moment, only a few of these are described here.

		casefolding
		errorcodes

		Future design
		-------------
		.. toctree::
		:maxdepth: 1

		future/idle_work

fs/bcachefs/Kconfig

+8 −0

Original line number	Diff line number	Diff line
		@@ -103,6 +103,14 @@ config BCACHEFS_PATH_TRACEPOINTS
		Enable extra tracepoints for debugging btree_path operations; we don't
		normally want these enabled because they happen at very high rates.

		config BCACHEFS_TRANS_KMALLOC_TRACE
		bool "Trace bch2_trans_kmalloc() calls"
		depends on BCACHEFS_FS

		config BCACHEFS_ASYNC_OBJECT_LISTS
		bool "Keep async objects on fast_lists for debugfs visibility"
		depends on BCACHEFS_FS && DEBUG_FS

		config MEAN_AND_VARIANCE_UNIT_TEST
		tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
		depends on KUNIT

fs/bcachefs/Makefile

+4 −0

Original line number	Diff line number	Diff line
		@@ -35,11 +35,13 @@ bcachefs-y := \
		disk_accounting.o \
		disk_groups.o \
		ec.o \
		enumerated_ref.o \
		errcode.o \
		error.o \
		extents.o \
		extent_update.o \
		eytzinger.o \
		fast_list.o \
		fs.o \
		fs-ioctl.o \
		fs-io.o \
		@@ -97,6 +99,8 @@ bcachefs-y := \
		varint.o \
		xattr.o

		bcachefs-$(CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS) += async_objs.o

		obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST) += mean_and_variance_test.o

		# Silence "note: xyz changed in GCC X.X" messages