This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
Pull btrfs updates from David Sterba:
"User visible changes, feature updates:
- when using block size > page size, enable direct IO
- fallback to buffered IO if the data profile has duplication,
workaround to avoid checksum mismatches on block group profiles
with redundancy, real direct IO is possible on single or RAID0
- redo export of zoned statistics, moved from sysfs to
/proc/pid/mountstats due to size limitations of the former
Experimental features:
- remove offload checksum tunable, intended to find best way to do it
but since we've switched to offload to thread for everything we
don't need it anymore
- initial support for remap-tree feature, a translation layer of
logical block addresses that allow changes without moving/rewriting
blocks to do eg. relocation, or other changes that require COW
Notable fixes:
- automatic removal of accidentally leftover chunks when
free-space-tree is enabled since mkfs.btrfs v6.16.1
- zoned mode:
- do not try to append to conventional zones when RAID is mixing
zoned and conventional drives
- fixup write pointers when mixing zoned and conventional on
DUP/RAID* profiles
- when using squota, relax deletion rules for qgroups with 0 members
to allow easier recovery from accounting bugs, also add more checks
to detect bad accounting
- fix periodic reclaim scanning, properly check boundary conditions
not to trigger it unexpectedly or miss the time to run it
- trim:
- continue after first error
- change reporting to the first detected error
- add more cancellation points
- reduce contention of big device lock that can block other
operations when there's lots of trimmed space
- when chunk allocation is forced (needs experimental build) fix
transaction abort when unexpected space layout is detected
Core:
- switch to crypto library API for checksumming, removed module
dependencies, pointer indirections, etc.
- error handling improvements
- adjust how and where transaction commit or abort are done and are
maybe not necessary
- minor compression optimization to skip single block ranges
- improve how compression folios are handled
- new and updated selftests
- cleanups, refactoring:
- auto-freeing and other automatic variable cleanup conversion
- structure size optimizations
- condition annotations"
* tag 'for-6.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
btrfs: get rid of compressed_bio::compressed_folios[]
btrfs: get rid of compressed_folios[] usage for encoded writes
btrfs: get rid of compressed_folios[] usage for compressed read
btrfs: remove the old btrfs_compress_folios() infrastructure
btrfs: switch to btrfs_compress_bio() interface for compressed writes
btrfs: introduce btrfs_compress_bio() helper
btrfs: zlib: introduce zlib_compress_bio() helper
btrfs: zstd: introduce zstd_compress_bio() helper
btrfs: lzo: introduce lzo_compress_bio() helper
btrfs: zoned: factor out the zone loading part into a testable function
btrfs: add cleanup function for btrfs_free_chunk_map
btrfs: tests: add cleanup functions for test specific functions
btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
btrfs: tests: add unit tests for pending extent walking functions
btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
btrfs: fix transaction commit blocking during trim of unallocated space
btrfs: handle user interrupt properly in btrfs_trim_fs()
btrfs: preserve first error in btrfs_trim_fs()
btrfs: continue trimming remaining devices on failure
btrfs: do not BUG_ON() in btrfs_remove_block_group()
...
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper to calculate an extent map's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of surrounding every caller of btrfs_is_shutdown() with unlikely,
move the unlikely into the helper itself, like we do in other places in
btrfs and is common in the kernel outside btrfs too. Also make the fs_info
argument of btrfs_is_shutdown() const.
On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight
reduction of the module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In many places we have pattern:
ret = ...;
return ret;
This can be simplified to a direct return, removing 'ret' if not
otherwise needed. The places in self tests are not converted so we can
add more test cases without changing surrounding code
(extent-map-tests.c:test_case_4()).
Signed-off-by: David Sterba <dsterba@suse.com>
In prepare_one_folio(), ret is initialized to 0 at declaration,
and in an error path we assign ret = 0 before jumping to the
again label to retry the operation. However, ret is immediately
overwritten by ret = set_folio_extent_mapped(folio) after the
again label.
Both assignments are never observed by any code path,
therefore they can be safely removed.
Signed-off-by: Massimiliano Pellizzer <mpellizzer.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
- fix missing btrfs_path release after printing a relocation error
message
- fix extent changeset leak on mmap write after failure to reserve
metadata
- fix fs devices list structure freeing, it could be potentially leaked
under some circumstances
- tree log fixes:
- fix incremental directory logging where inodes for new dentries
were incorrectly skipped
- don't log conflicting inode if it's a directory moved in the
current transaction
- regression fixes:
- fix incorrect btrfs_path freeing when it's auto-cleaned
- revert commit simplifying preallocation of temporary structures
in qgroup functions, some cases were not handled properly
* tag 'for-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix changeset leak on mmap write after failure to reserve metadata
btrfs: fix memory leak of fs_devices in degraded seed device path
btrfs: fix a potential path leak in print_data_reloc_error()
Revert "btrfs: add ASSERTs on prealloc in qgroup functions"
btrfs: do not skip logging new dentries when logging a new name
btrfs: don't log conflicting inode if it's a dir moved in the current transaction
btrfs: tests: fix double btrfs_path free in remove_extent_ref()
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
A new fs state EMERGENCY_SHUTDOWN is introduced, which is btrfs'
equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN, after entering
emergency shutdown state, all operations will return errors (-EIO), and
can not be bring back to normal state until unmouont.
The new state will reject the following file operations:
- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
This requires a small wrapper to do the extra shutdown check, then call
the regular filemap_splice_read() function
This should reject most of the file operations on a shutdown btrfs.
And for the existing dirty folios, extra shutdown checks are introduced
to the following functions:
- run_delalloc_nocow()
- run_delalloc_compressed()
- cow_file_range()
So that dirty ranges will still be properly cleaned without being
submitted.
Finally the shutdown state will also set the fs error, so that no new
transaction will be committed, protecting the metadata from any possible
further corruption.
And when the fs entered shutdown mode for the first time, a critical
level kernel message will show up to indicate the incident.
That message will be important for end users as rejected delalloc ranges
will output error messages, hopefully that shutdown message and the fact
that all fs operations are returning error will prevent end users from
getting too confused about the delalloc error messages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Tested-by: Anand Jain <asj@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a couple places doing the computation "pos + write_bytes" when we
already have it in the local variable "end_pos". Change then to use the
variable instead and make source code smaller. Also make the variable
const since it's not supposed to change.
This also has a very slight reduction in the module size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1915990 161647 15592 2093229 1ff0ad fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1915974 161647 15592 2093213 1ff09d fs/btrfs/btrfs.ko
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_fallocate(), when the allocated range overlaps with a prealloc
extent and the extent starts after i_size, the range doesn't get marked
dirty in file_extent_tree. This results in persisting an incorrect
disk_i_size for the inode when not using the no-holes feature.
This is reproducible since commit 41a2ee75aa ("btrfs: introduce
per-inode file extent tree"), then became hidden since commit 3d7db6e8bd
("btrfs: don't allocate file extent tree for non regular files") and then
visible again after commit 8679d2687c ("btrfs: initialize
inode::file_extent_tree after i_mode has been set"), which fixes the
previous commit.
The following reproducer triggers the problem:
$ cat test.sh
MNT=/mnt/test
DEV=/dev/vdb
mkdir -p $MNT
mkfs.btrfs -f -O ^no-holes $DEV
mount $DEV $MNT
touch $MNT/file1
fallocate -n -o 1M -l 2M $MNT/file1
umount $MNT
mount $DEV $MNT
len=$((1 * 1024 * 1024))
fallocate -o 1M -l $len $MNT/file1
du --bytes $MNT/file1
umount $MNT
mount $DEV $MNT
du --bytes $MNT/file1
umount $MNT
Running the reproducer gives the following result:
$ ./test.sh
(...)
2097152 /mnt/test/file1
1048576 /mnt/test/file1
The difference is exactly 1048576 as we assigned.
Fix by adding a call to btrfs_inode_set_file_extent_range() in
btrfs_fallocate_update_isize().
Fixes: 41a2ee75aa ("btrfs: introduce per-inode file extent tree")
Signed-off-by: austinchang <austinchang@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull mmap_prepare updates from Christian Brauner:
"Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
introduce new .mmap_prepare() file callback").
This is preferred to the existing f_op->mmap() hook as it does require
a VMA to be established yet, thus allowing the mmap logic to invoke
this hook far, far earlier, prior to inserting a VMA into the virtual
address space, or performing any other heavy handed operations.
This allows for much simpler unwinding on error, and for there to be a
single attempt at merging a VMA rather than having to possibly
reattempt a merge based on potentially altered VMA state.
Far more importantly, it prevents inappropriate manipulation of
incompletely initialised VMA state, which is something that has been
the cause of bugs and complexity in the past.
The intent is to gradually deprecate f_op->mmap, and in that vein this
series coverts the majority of file systems to using f_op->mmap_prepare.
Prerequisite steps are taken - firstly ensuring all checks for mmap
capabilities use the file_has_valid_mmap_hooks() helper rather than
directly checking for f_op->mmap (which is now not a valid check) and
secondly updating daxdev_mapping_supported() to not require a VMA
parameter to allow ext4 and xfs to be converted.
Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") handles the nasty edge-case of nested file
systems like overlayfs, which introduces a compatibility shim to allow
f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.
This allows for nested filesystems to continue to function correctly
with all file systems regardless of which callback is used. Once we
finally convert all file systems, this shim can be removed.
As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
can nest all other file systems.
We additionally do not update resctl - as this requires an update to
remap_pfn_range() (or an alternative to it) which we defer to a later
series, equally we do not update cramfs which needs a mixed mapping
insertion with the same issue, nor do we update procfs, hugetlbfs,
syfs or kernfs all of which require VMAs for internal state and hooks.
We shall return to all of these later"
* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
doc: update porting, vfs documentation to describe mmap_prepare()
fs: replace mmap hook with .mmap_prepare for simple mappings
fs: convert most other generic_file_*mmap() users to .mmap_prepare()
fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
fs/dax: make it possible to check dev dax support without a VMA
fs: consistently use can_mmap_file() helper
mm/nommu: use file_has_valid_mmap_hooks() helper
mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
Currently btrfs_check_nocow_lock() stops at the first extent it finds and
that extent may be smaller than the target range we want to NOCOW into.
But we can have multiple consecutive extents which we can NOCOW into, so
by stopping at the first one we find we just make the caller do more work
by splitting the write into multiple ones, or in the case of mmap writes
with large folios we fail with -ENOSPC in case the folio's range is
covered by more than one extent (the fallback to NOCOW for mmap writes in
case there's no available data space to reserve/allocate was recently
added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW
files/extents").
Improve on this by checking for multiple consecutive extents.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The documentation for the @nowait parameter is missing, so add it.
The @nowait parameter was added in commit 80f9d24130 ("btrfs: make
btrfs_check_nocow_lock nowait compatible"), which forgot to update the
function comment.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most of the time we want to use the btrfs_inode, so change the local inode
variable to be a btrfs_inode instead of a VFS inode, reducing verbosity
by eliminating a lot of BTRFS_I() calls.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the inode's io_tree already stored in a local variable, so use it
instead of grabbing it again in the call to btrfs_clear_extent_bit().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt a mmap write into a NOCOW file or a prealloc extent when
there is no more available data space (or unallocated space to allocate a
new data block group) and we can do a NOCOW write (there are no reflinks
for the target extent or snapshots), we always fail due to -ENOSPC, unlike
for the regular buffered write and direct IO paths where we check that we
can do a NOCOW write in case we can't reserve data space.
Simple reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV
mount $DEV $MNT
touch $MNT/foobar
# Make it a NOCOW file.
chattr +C $MNT/foobar
# Add initial data to file.
xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar
# Fill all the remaining data space and unallocated space with data.
dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null
# Overwrite the file with a mmap write. Should succeed.
xfs_io -c "mmap -w 0 1M" \
-c "mwrite -S 0xcd 0 1M" \
-c "munmap" \
$MNT/foobar
# Unmount, mount again and verify the new data was persisted.
umount $MNT
mount $DEV $MNT
od -A d -t x1 $MNT/foobar
umount $MNT
Running this:
$ ./test.sh
(...)
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec)
./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
1048576
Fix this by not failing in case we can't allocate data space and we can
NOCOW into the target extent - reserving only metadata space in this case.
After this change the test passes:
$ ./test.sh
(...)
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec)
0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
1048576
A test case for fstests will be added soon.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can avoid potential memory allocation failure in
btrfs_replace_file_extents() as the block reserve lifetime is limited to
the scope of the function. This requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().
This callback is invoked in the mmap() logic far earlier, so error handling
can be performed more safely without complicated and bug-prone state
unwinding required should an error arise.
This hook also avoids passing a pointer to a not-yet-correctly-established
VMA avoiding any issues with referencing this data structure.
It rather provides a pointer to the new struct vm_area_desc descriptor type
which contains all required state and allows easy setting of required
parameters without any consideration needing to be paid to locking or
reference counts.
Note that nested filesystems like overlayfs are compatible with an
.mmap_prepare() callback since commit bb666b7c27 ("mm: add mmap_prepare()
compatibility layer for nested file systems").
In this patch we apply this change to file systems with relatively simple
mmap() hook logic - exfat, ceph, f2fs, bcachefs, zonefs, btrfs, ocfs2,
orangefs, nilfs2, romfs, ramfs and aio.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/f528ac4f35b9378931bd800920fee53fc0c5c74d.1750099179.git.lorenzo.stoakes@oracle.com
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
We have two variables to track return values, ret and ret2, with types
vm_fault_t (an unsigned int type) and int, which makes it a bit confusing
and harder to keep track. So use a single variable, of type int, and under
the 'out' label return vmf_error(ret) in case ret contains an error,
otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before
and it's simpler.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_set_extent_delalloc() fails we are always returning
VM_FAULT_SIGBUS, which is odd since the error means "bad access" and the
most likely cause for btrfs_set_extent_delalloc() is -ENOMEM, which should
be translated to VM_FAULT_OOM.
Instead of returning VM_FAULT_SIGBUS return vmf_error(ret2), which gives
us a more appropriate return value, and we use that everywhere else too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this entangled error checks early at btrfs_page_mkwrite():
1) Try to reserve delalloc space by calling btrfs_delalloc_reserve_space()
and storing the return value in the ret2 variable;
2) If the reservation succeed, call file_update_time() and store the
return value in ret2 and also set the local variable 'reserved' to
true (1);
3) Then do an error check on ret2 to see if any of the previous calls
failed and if so, jump either to the 'out' label or to the
'out_noreserve' label, depending on whether 'reserved' is true or
not.
This is unnecessarily complex. Instead change this to a simpler and
more straightforward approach:
1) Call btrfs_delalloc_reserve_space(), if that returns an error jump to
the 'out_noreserve' label;
2) The call file_update_time() and if that returns an error jump to the
'out' label.
Like this there's less nested if statements, no need to use a local
variable to track if space was reserved and if statements are used only
to check errors.
Also move the call to extent_changeset_free() out of the 'out_noreserve'
label and under the 'out' label since the changeset is allocated only if
the call to reserve delalloc space succeeded.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the last call to btrfs_delalloc_release_space() where the value of the
variable 'ret' is never zero, we pass the expression 'ret != 0' as the
value for the argument 'qgroup_free', which always evaluates to true.
Make this less confusing and more clear by explicitly passing true
instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're doing a mmap write against a folio that has i_size somewhere in
the middle and we have multiple sectors in the folio, we may have to
release excess space previously reserved, for the range going from the
rounded up (to sector size) i_size to the folio's end offset. We are
calculating the right amount to release and passing it to
btrfs_delalloc_release_space(), but we are passing the wrong start offset
of that range - we're passing the folio's start offset instead of the
end offset, plus 1, of the range for which we keep the reservation. This
may result in releasing more space then we should and eventually trigger
an underflow of the data space_info's bytes_may_use counter.
So fix this by passing the start offset as 'end + 1' instead of
'page_start' to btrfs_delalloc_release_space().
Fixes: d0b7da88f6 ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to have special logic to return -EAGAIN in case the call
to __filemap_get_folio() fails, because when FGP_NOWAIT is passed to
__filemap_get_folio() it returns ERR_PTR(-EAGAIN) if it needs to do
something that would imply blocking.
The reason we have this logic is from the days before we migrated to the
folio interface, when we called pagecache_get_page() which would return
NULL instead of an error pointer.
So remove this special casing and always return the error that the call
to __filemap_get_folio() returned.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following fsx sequence will fail on btrfs with 64K page size and 4K
fs block size:
#fsx -d -e 1 -N 4 $mnt/junk -S 36386
READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
OFFSET GOOD BAD RANGE
0xe9ba 0x0000 0x03ac 0x0
operation# (mod 256) for the bad data may be 3
...
LOG DUMP (4 total operations):
1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW
2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW
3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes)
4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR***
[CAUSE]
Only 2 operations are really involved in this case:
3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes)
3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes)
4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes)
At operation 3, fsx pollutes beyond EOF, that is done by mmap()
and write into that mmap() range beyond EOF.
Such write will fill the range beyond EOF, but it will never reach disk
as ranges beyond EOF will not be marked dirty nor uptodate.
Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
our isize (which was 0x5448), we should zero out any range beyond
EOF (0x5448).
During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
unaligned head block.
But that function only really zeroes out the block at [0x5000, 0x5fff], it
doesn't bother any range other that that block, since those ranges will
not be marked dirty nor written back.
So the range [0x6000, 0xffff] is still polluted, and later mapread()
will return the poisoned value.
[FIX]
Enhance btrfs_truncate_block() by:
- Pass a @start/@end pair to indicate the full truncation range
This is to handle the following truncation case:
Page size is 64K, fs block size is 4K, truncate range is
[6K, 60K]
0 32K 64K
| |///////////////////////////////////| |
6K 60K
The range is not aligned for its head block, so we need to call
btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.
But with that information we only know to zero the range [6K, 8K),
if we zero out the range [6K, 64K), the last block will also be
zeroed, causing data loss.
So here we need the full range we're truncating, so that we can avoid
over-truncation.
- Rename @from to @offset
As now the parameter is only utilized to locate a block, it's not
really carrying the old @from meaning well.
- Remove @front parameter
With the full truncate range passed in, we can determine if the
@offset is at the head or tail block.
- Skip truncation if @offset is not in the head nor tail blocks
The call site in hole punch unconditionally call
btrfs_truncate_block() without even checking the range is aligned or
not.
If the @offset is neither in the head nor in tail block, it means we can
safely ignore it.
- Skip truncate if the range inside the target block is already aligned
- Make btrfs_truncate_block() zero all blocks beyond EOF
Since we have the original range, we know exactly if we're doing
truncation beyond EOF (the @end will be (u64)-1).
If we're doing truncation beyond EOF, then enlarge the truncation
range to the folio end, to address the possibly polluted ranges.
Otherwise still keep the zero range inside the block, as we can have
large data folios soon, always truncating every blocks inside the same
folio can be costly for large folios.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using 'i' for a parameter is confusing and conforming to current
preferences, so rename it to 'iter'.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
Rename the function to add 'btrfs_' prefix to it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their names to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.
So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only blockage is the ASSERT() rejecting large folios, just remove
it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_page_mkwrite() has an explicit ASSERT() checking the
folio order.
To make it support large data folios, we need to:
- Remove the ASSERT(folio_order(folio) == 0)
- Use folio_contains() to check if the folio covers the last page
Otherwise the code is already supporting large folios well.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_punch_hole_lock_range() needs to make sure there is
no other folio in the range, thus it goes with filemap_range_has_page(),
which works pretty fine.
But if we have large folios, under the following case
filemap_range_has_page() will always return true, forcing
btrfs_punch_hole_lock_range() to do a very time consuming busy loop:
start end
| |
|//|//|//|//| | | | | | | | |//|//|
\ / \ /
Folio A Folio B
In the above case, folio A and B contain our start/end indexes, and there
are no other folios in the range. Thus we do not need to retry inside
btrfs_punch_hole_lock_range().
To prepare for large data folios, introduce a helper,
check_range_has_page(), which will:
- Shrink the search range towards page boundaries
If the rounded down end (exclusive, otherwise it can underflow when @end
is inside the folio at file offset 0) is no larger than the rounded up
start, it means the range contains no other pages other than the ones
covering @start and @end.
Can return false directly in that case.
- Grab all the folios inside the range
- Skip any large folios that cover the start and end indexes
- If any other folios are found return true
- Otherwise return false
This new helper is going to handle both large folios and regular ones.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>