Commit c562ba61 authored by Robbie Ko's avatar Robbie Ko Committed by David Sterba
Browse files

btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap



When fallocate() with FALLOC_FL_KEEP_SIZE preallocates an extent past the
current i_size, the file_extent_tree of the inode is updated to cover
that range. However, on the next mount, btrfs_read_locked_inode() only
re-populates file_extent_tree with [0, round_up(i_size, sectorsize)),
losing the marks that belonged to the KEEP_SIZE prealloc extent beyond
i_size.

Later, when a non-KEEP_SIZE fallocate() extends i_size into / past that
old prealloc extent, the reservation loop in btrfs_fallocate() skips
already-prealloc segments and does not call into the path that marks the
file_extent_tree, so a gap remains inside the file_extent_tree across
[old_aligned_i_size, start_of_new_alloc). Then __btrfs_prealloc_file_range()
calls btrfs_inode_safe_disk_i_size_write(), which uses
find_contiguous_extent_bit() starting at offset 0 to derive disk_i_size.
The walk stops at the gap, so disk_i_size ends up smaller than i_size and
gets persisted. After the next mount, the file shows the wrong (smaller)
size.

The following reproducer triggers the problem:

  $ cat test.sh
  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkdir -p $MNT
  mkfs.btrfs -f -O ^no-holes $DEV
  mount $DEV $MNT

  touch $MNT/file1
  # KEEP_SIZE prealloc beyond i_size (i_size stays 0)
  fallocate -n -o 4M -l 4M $MNT/file1
  umount $MNT
  mount $DEV $MNT

  # non-KEEP_SIZE fallocate that overlaps the previous prealloc tail
  # and extends past it
  fallocate -o 7M -l 2M $MNT/file1
  ls -lh $MNT/file1
  umount $MNT
  mount $DEV $MNT
  ls -lh $MNT/file1
  umount $MNT

Running the reproducer gives the following result:

  $ ./test.sh
  (...)
  -rw-rw-r-- 1 root root 9.0M May  4 16:35 /mnt/sdi/file1
  -rw-rw-r-- 1 root root 7.0M May  4 16:35 /mnt/sdi/file1

The size before the second mount is correct (9M), but after the
remount it drops to 7M, i.e. the start of the gap inside file_extent_tree.

Fix this in __btrfs_prealloc_file_range() by marking the entire range
[round_down(old_i_size, sectorsize), round_up(new_i_size, sectorsize))
in file_extent_tree before updating i_size and calling
btrfs_inode_safe_disk_i_size_write(). This ensures the contiguous bit
search starting from 0 is not truncated by a stale gap left behind by a
previous KEEP_SIZE prealloc that was not restored on inode load.

The fix has no effect when the NO_HOLES feature is enabled because
btrfs_inode_safe_disk_i_size_write() and
btrfs_inode_set_file_extent_range()
both take the fast path that directly tracks disk_i_size without
consulting file_extent_tree.

Fixes: 9ddc959e ("btrfs: use the file extent tree infrastructure")
Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
[ Minor updates to the change log ]
Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
parent 4066c55e
Loading
Loading
Loading
Loading
+28 −0
Original line number Diff line number Diff line
@@ -9299,10 +9299,38 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
		if (!(mode & FALLOC_FL_KEEP_SIZE) &&
		    (actual_len > inode->i_size) &&
		    (cur_offset > inode->i_size)) {
			u64 range_start;
			u64 range_end;

			if (cur_offset > actual_len)
				i_size = actual_len;
			else
				i_size = cur_offset;

			/*
			 * Make sure the file_extent_tree covers the entire
			 * range [old_i_size, new_i_size) before we update
			 * disk_i_size. Without this, a previous KEEP_SIZE
			 * prealloc that extended past i_size (and was lost
			 * across umount/mount because file_extent_tree is
			 * only populated up to round_up(i_size) on inode
			 * load) can leave a gap inside this range. That gap
			 * would cause btrfs_inode_safe_disk_i_size_write()
			 * (via find_contiguous_extent_bit() starting at 0)
			 * to truncate disk_i_size to the start of the gap,
			 * making the persisted size smaller than i_size.
			 */
			range_start = round_down(inode->i_size, fs_info->sectorsize);
			range_end = round_up(i_size, fs_info->sectorsize);
			ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode),
					range_start, range_end - range_start);
			if (ret) {
				btrfs_abort_transaction(trans, ret);
				if (own_trans)
					btrfs_end_transaction(trans);
				break;
			}

			i_size_write(inode, i_size);
			btrfs_inode_safe_disk_i_size_write(BTRFS_I(inode), 0);
		}