Commit Graph

1412051 Commits

Author SHA1 Message Date
Chao Yu
ab59919c8a f2fs: check skipped write in f2fs_enable_checkpoint()
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().

So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27 02:45:44 +00:00
Jaegeuk Kim
993663874b Revert "f2fs: add timeout in f2fs_enable_checkpoint()"
This reverts commit 4bc3477796.

Let's apply a better approach to flush the only dirty pages committed by user
to avoid the delay caused by unncessary incoming ones.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-20 20:54:14 +00:00
Chao Yu
a5d8b9d94e f2fs: fix to unlock folio in f2fs_read_data_large_folio()
We missed to unlock folio in error path of f2fs_read_data_large_folio(),
fix it.

With below testcase, it can reproduce the bug.

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
f2fs_io clearflags immutable /mnt/f2fs/file
echo 1 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-19 17:07:47 +00:00
Chao Yu
fe15bc3d44 f2fs: fix error path handling in f2fs_read_data_large_folio()
In error path of f2fs_read_data_large_folio(), if bio is valid, it
may submit bio twice, fix it.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-19 17:07:47 +00:00
Jaegeuk Kim
ec8bb999dc f2fs: use folio_end_read
No logic change.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
5c145c0318 f2fs: fix to avoid mapping wrong physical block for swapfile
Xiaolong Guo reported a f2fs bug in bugzilla [1]

[1] https://bugzilla.kernel.org/show_bug.cgi?id=220951

Quoted:

"When using stress-ng's swap stress test on F2FS filesystem with kernel 6.6+,
the system experiences data corruption leading to either:
1 dm-verity corruption errors and device reboot
2 F2FS node corruption errors and boot hangs

The issue occurs specifically when:
1 Using F2FS filesystem (ext4 is unaffected)
2 Swapfile size is less than F2FS section size (2MB)
3 Swapfile has fragmented physical layout (multiple non-contiguous extents)
4 Kernel version is 6.6+ (6.1 is unaffected)

The root cause is in check_swap_activate() function in fs/f2fs/data.c. When the
first extent of a small swapfile (< 2MB) is not aligned to section boundaries,
the function incorrectly treats it as the last extent, failing to map
subsequent extents. This results in incorrect swap_extent creation where only
the first extent is mapped, causing subsequent swap writes to overwrite wrong
physical locations (other files' data).

Steps to Reproduce
1 Setup a device with F2FS-formatted userdata partition
2 Compile stress-ng from https://github.com/ColinIanKing/stress-ng
3 Run swap stress test: (Android devices)
adb shell "cd /data/stressng; ./stress-ng-64 --metrics-brief --timeout 60
--swap 0"

Log:
1 Ftrace shows in kernel 6.6, only first extent is mapped during second
f2fs_map_blocks call in check_swap_activate():
stress-ng-swap-8990: f2fs_map_blocks: ino=11002, file offset=0, start
blkaddr=0x43143, len=0x1
(Only 4KB mapped, not the full swapfile)
2 in kernel 6.1, both extents are correctly mapped:
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=0, start
blkaddr=0x13cd4, len=0x1
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=1, start
blkaddr=0x60c84b, len=0xff

The problematic code is in check_swap_activate():
if ((pblock - SM_I(sbi)->main_blkaddr) % blks_per_sec ||
    nr_pblocks % blks_per_sec ||
    !f2fs_valid_pinned_area(sbi, pblock)) {
    bool last_extent = false;

    not_aligned++;

    nr_pblocks = roundup(nr_pblocks, blks_per_sec);
    if (cur_lblock + nr_pblocks > sis->max)
        nr_pblocks -= blks_per_sec;

    /* this extent is last one */
    if (!nr_pblocks) {
        nr_pblocks = last_lblock - cur_lblock;
        last_extent = true;
    }

    ret = f2fs_migrate_blocks(inode, cur_lblock, nr_pblocks);
    if (ret) {
        if (ret == -ENOENT)
            ret = -EINVAL;
        goto out;
    }

    if (!last_extent)
        goto retry;
}

When the first extent is unaligned and roundup(nr_pblocks, blks_per_sec)
exceeds sis->max, we subtract blks_per_sec resulting in nr_pblocks = 0. The
code then incorrectly assumes this is the last extent, sets nr_pblocks =
last_lblock - cur_lblock (entire swapfile), and performs migration. After
migration, it doesn't retry mapping, so subsequent extents are never processed.
"

In order to fix this issue, we need to lookup block mapping info after
we migrate all blocks in the tail of swapfile.

Cc: stable@kernel.org
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Cc: Daeho Jeong <daehojeong@google.com>
Reported-and-tested-by: Xiaolong Guo <guoxiaolong2008@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220951
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
fe2961fb77 f2fs: avoid f2fs_map_blocks() for consecutive holes in readpages
For consecutive large hole mapping across {d,id,did}nodes , we don't
need to call f2fs_map_blocks() to check one hole block per one time,
instead, we can use map.m_next_pgofs as a hint of next potential valid
block, so that we can skip calling f2fs_map_blocks the range of
[cur_pgofs + 1, .m_next_pgofs).

1) regular case

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Before:
real    0m0.706s
user    0m0.000s
sys     0m0.706s

After:
real    0m0.620s
user    0m0.008s
sys     0m0.611s

2) large folio case

touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024

Before:
real    0m0.438s
user    0m0.004s
sys     0m0.433s

After:
real    0m0.368s
user    0m0.004s
sys     0m0.364s

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Nanzhe Zhao
d194f112a9 f2fs: advance index and offset after zeroing in large folio read
In f2fs_read_data_large_folio(), the block zeroing path calls
folio_zero_range() and then continues the loop. However, it fails to
advance index and offset before continuing.

This can cause the loop to repeatedly process the same subpage of the
folio, leading to stalls/hangs and incorrect progress when reading large
folios with holes/zeroed blocks.

Fix it by advancing index and offset unconditionally in the loop
iteration, so they are updated even when the zeroing path continues.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Nanzhe Zhao
6afd05ca6d f2fs: add 'folio_in_bio' to handle readahead folios with no BIO submission
f2fs_read_data_large_folio() can build a single read BIO across multiple
folios during readahead. If a folio ends up having none of its subpages
added to the BIO (e.g. all subpages are zeroed / treated as holes), it
will never be seen by f2fs_finish_read_bio(), so folio_end_read() is
never called. This leaves the folio locked and not marked uptodate.

Track whether the current folio has been added to a BIO via a local
'folio_in_bio' bool flag, and when iterating readahead folios, explicitly
mark the folio uptodate (on success) and unlock it when nothing was added.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Yongpeng Yang
540d34c182 f2fs: avoid unnecessary block mapping lookups in f2fs_read_data_large_folio
In the second call to f2fs_map_blocks within f2fs_read_data_large_folio,
map.m_len exceeds the logical address space to be read. This patch
ensures map.m_len does not exceed the required address space.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
93ffb6c28f f2fs: detect more inconsistent cases in sanity_check_node_footer()
Let's enhance sanity_check_node_footer() to detect more inconsistent
cases as below:

Node Type			Node Footer Info
===================		=============================
NODE_TYPE_REGULAR		inode = true and xnode = true
NODE_TYPE_INODE			inode = false or xnode = true
NODE_TYPE_XATTR			inode = true or xnode = false
NODE_TYPE_NON_INODE		inode = false

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:35 +00:00
Chao Yu
50ac3ecd8e f2fs: fix to do sanity check on node footer in {read,write}_end_io
-----------[ cut here ]------------
kernel BUG at fs/f2fs/data.c:358!
Call Trace:
 <IRQ>
 blk_update_request+0x5eb/0xe70 block/blk-mq.c:987
 blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149
 blk_complete_reqs block/blk-mq.c:1224 [inline]
 blk_done_softirq+0x107/0x160 block/blk-mq.c:1229
 handle_softirqs+0x283/0x870 kernel/softirq.c:579
 __do_softirq kernel/softirq.c:613 [inline]
 invoke_softirq kernel/softirq.c:453 [inline]
 __irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
 irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
 sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
 </IRQ>

In f2fs_write_end_io(), it detects there is inconsistency in between
node page index (nid) and footer.nid of node page.

If footer of node page is corrupted in fuzzed image, then we load corrupted
node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(),
in where we won't do sanity check on node footer, once node page becomes
dirty, we will encounter this bug after node page writeback.

Cc: stable@kernel.org
Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
0a736109c9 f2fs: fix to do sanity check on node footer in __write_node_folio()
Add node footer sanity check during node folio's writeback, if sanity
check fails, let's shutdown filesystem to avoid looping to redirty
and writeback in .writepages.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Yangyang Zang
f7b929eda1 f2fs: clean up the type parameter in f2fs_sync_meta_pages()
Clean up code to improve readability, no logic changes.

Signed-off-by: Yangyang Zang <zangyangyang1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Daeho Jeong
e48e16f3e3 f2fs: support non-4KB block size without packed_ssa feature
Currently, F2FS requires the packed_ssa feature to be enabled when
utilizing non-4KB block sizes (e.g., 16KB). This restriction limits
the flexibility of filesystem formatting options.

This patch allows F2FS to support non-4KB block sizes even when the
packed_ssa feature is disabled. It adjusts the SSA calculation logic to
correctly handle summary entries in larger blocks without the packed
layout.

Cc: stable@kernel.org
Fixes: 7ee8bc3942 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
1dd3b437d4 f2fs: make FAULT_DISCARD obsolete
__blkdev_issue_discard() in __submit_discard_cmd() will never fail, so
let's make FAULT_DISCARD fault injection obsolete.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
ce2739e482 f2fs: fix to avoid UAF in f2fs_write_end_io()
As syzbot reported an use-after-free issue in f2fs_write_end_io().

It is caused by below race condition:

loop device				umount
- worker_thread
 - loop_process_work
  - do_req_filebacked
   - lo_rw_aio
    - lo_rw_aio_complete
     - blk_mq_end_request
      - blk_update_request
       - f2fs_write_end_io
        - dec_page_count
        - folio_end_writeback
					- kill_f2fs_super
					 - kill_block_super
					  - f2fs_put_super
					 : free(sbi)
       : get_pages(, F2FS_WB_CP_DATA)
         accessed sbi which is freed

In kill_f2fs_super(), we will drop all page caches of f2fs inodes before
call free(sbi), it guarantee that all folios should end its writeback, so
it should be safe to access sbi before last folio_end_writeback().

Let's relocate ckpt thread wakeup flow before folio_end_writeback() to
resolve this issue.

Cc: stable@kernel.org
Fixes: e234088758 ("f2fs: avoid wait if IO end up when do_checkpoint for better performance")
Reported-by: syzbot+b4444e3c972a7a124187@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4444e3c972a7a124187
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-17 00:00:34 +00:00
Chao Yu
3996b70209 Revert "f2fs: block cache/dio write during f2fs_enable_checkpoint()"
This reverts commit 196c81fdd4.

Original patch may cause below deadlock, revert it.

write				remount
- write_begin
 - lock_page  --- lock A
 - prepare_write_begin
  - f2fs_map_lock
				- f2fs_enable_checkpoint
				 - down_write(cp_enable_rwsem)  --- lock B
				 - sync_inode_sb
				  - writepages
				   - lock_page			--- lock A
   - down_read(cp_enable_rwsem)  --- lock A

Cc: stable@kernel.org
Fixes: 196c81fdd4 ("f2fs: block cache/dio write during f2fs_enable_checkpoint()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-16 03:49:31 +00:00
Chao Yu
0eda086de8 f2fs: fix to check sysfs filename w/ gc_pin_file_thresh correctly
Sysfs entry name is gc_pin_file_thresh instead of gc_pin_file_threshold,
fix it.

Cc: stable@kernel.org
Fixes: c521a6ab4a ("f2fs: fix to limit gc_pin_file_threshold")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:09 +00:00
Yongpeng Yang
7633a7387e f2fs: fix IS_CHECKPOINTED flag inconsistency issue caused by concurrent atomic commit and checkpoint writes
During SPO tests, when mounting F2FS, an -EINVAL error was returned from
f2fs_recover_inode_page. The issue occurred under the following scenario

Thread A                                     Thread B
f2fs_ioc_commit_atomic_write
 - f2fs_do_sync_file // atomic = true
  - f2fs_fsync_node_pages
    : last_folio = inode folio
    : schedule before folio_lock(last_folio) f2fs_write_checkpoint
                                              - block_operations// writeback last_folio
                                              - schedule before f2fs_flush_nat_entries
    : set_fsync_mark(last_folio, 1)
    : set_dentry_mark(last_folio, 1)
    : folio_mark_dirty(last_folio)
    - __write_node_folio(last_folio)
      : f2fs_down_read(&sbi->node_write)//block
                                              - f2fs_flush_nat_entries
                                                : {struct nat_entry}->flag |= BIT(IS_CHECKPOINTED)
                                              - unblock_operations
                                                : f2fs_up_write(&sbi->node_write)
                                             f2fs_write_checkpoint//return
      : f2fs_do_write_node_page()
f2fs_ioc_commit_atomic_write//return
                                             SPO

Thread A calls f2fs_need_dentry_mark(sbi, ino), and the last_folio has
already been written once. However, the {struct nat_entry}->flag did not
have the IS_CHECKPOINTED set, causing set_dentry_mark(last_folio, 1) and
write last_folio again after Thread B finishes f2fs_write_checkpoint.

After SPO and reboot, it was detected that {struct node_info}->blk_addr
was not NULL_ADDR because Thread B successfully write the checkpoint.

This issue only occurs in atomic write scenarios. For regular file
fsync operations, the folio must be dirty. If
block_operations->f2fs_sync_node_pages successfully submit the folio
write, this path will not be executed. Otherwise, the
f2fs_write_checkpoint will need to wait for the folio write submission
to complete, as sbi->nr_pages[F2FS_DIRTY_NODES] > 0. Therefore, the
situation where f2fs_need_dentry_mark checks that the {struct
nat_entry}->flag /wo the IS_CHECKPOINTED flag, but the folio write has
already been submitted, will not occur.

Therefore, for atomic file fsync, sbi->node_write should be acquired
through __write_node_folio to ensure that the IS_CHECKPOINTED flag
correctly indicates that the checkpoint write has been completed.

Fixes: 608514deba ("f2fs: set fsync mark only for the last dnode")
Cc: stable@kernel.org
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:09 +00:00
Yongpeng Yang
071e50d61c f2fs: change seq_file_ra_mul and max_io_bytes to unsigned int
{struct file_ra_state}->ra_pages and {struct bio}->bi_iter.bi_size is
defined as unsigned int, so values of seq_file_ra_mul and max_io_bytes
exceeding UINT_MAX are meaningless.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:09 +00:00
Yongpeng Yang
98ea0039db f2fs: fix out-of-bounds access in sysfs attribute read/write
Some f2fs sysfs attributes suffer from out-of-bounds memory access and
incorrect handling of integer values whose size is not 4 bytes.

For example:
vm:~# echo 65537 > /sys/fs/f2fs/vde/carve_out
vm:~# cat /sys/fs/f2fs/vde/carve_out
65537
vm:~# echo 4294967297 > /sys/fs/f2fs/vde/atgc_age_threshold
vm:~# cat /sys/fs/f2fs/vde/atgc_age_threshold
1

carve_out maps to {struct f2fs_sb_info}->carve_out, which is a 8-bit
integer. However, the sysfs interface allows setting it to a value
larger than 255, resulting in an out-of-range update.

atgc_age_threshold maps to {struct atgc_management}->age_threshold,
which is a 64-bit integer, but its sysfs interface cannot correctly set
values larger than UINT_MAX.

The root causes are:
1. __sbi_store() treats all default values as unsigned int, which
prevents updating integers larger than 4 bytes and causes out-of-bounds
writes for integers smaller than 4 bytes.

2. f2fs_sbi_show() also assumes all default values are unsigned int,
leading to out-of-bounds reads and incorrect access to integers larger
than 4 bytes.

This patch introduces {struct f2fs_attr}->size to record the actual size
of the integer associated with each sysfs attribute. With this
information, sysfs read and write operations can correctly access and
update values according to their real data size, avoiding memory
corruption and truncation.

Fixes: b59d0bae6c ("f2fs: add sysfs support for controlling the gc_thread")
Cc: stable@kernel.org
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Nanzhe Zhao
c0c589fa1d f2fs: Accounting large folio subpages before bio submission
In f2fs_read_data_large_folio(), read_pages_pending is incremented only
after the subpage has been added to the BIO.  With a heavily fragmented
file, each new subpage can force submission of the previous BIO.

If the BIO completes quickly, f2fs_finish_read_bio() may decrement
read_pages_pending to zero and call folio_end_read() while the read loop
is still processing other subpages of the same large folio.

Fix the ordering by incrementing read_pages_pending before any possible
BIO submission for the current subpage, matching the iomap ordering and
preventing premature folio_end_read().

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Nanzhe Zhao
00feea1dfc f2fs: Zero f2fs_folio_state on allocation
f2fs_folio_state is attached to folio->private and is expected to start
with read_pages_pending == 0.  However, the structure was allocated from
ffs_entry_slab without being fully initialized, which can leave
read_pages_pending with stale values.

Allocate the object with __GFP_ZERO so all fields are reliably zeroed at
creation time.

Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
d36de29f4b f2fs: sysfs: introduce inject_lock_timeout
This patch adds a new sysfs node in /sys/fs/f2fs/<disk>/inject_lock_timeout,
it relies on CONFIG_F2FS_FAULT_INJECTION kernel config.

It can be used to simulate different type of timeout in lock duration.

==========     ===============================
Flag_Value     Flag_Description
==========     ===============================
0x00000000     No timeout (default)
0x00000001     Simulate running time
0x00000002     Simulate IO type sleep time
0x00000003     Simulate Non-IO type sleep time
0x00000004     Simulate runnable time
==========     ===============================

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
c56254e2e0 f2fs: introduce FAULT_LOCK_TIMEOUT
This patch introduce a new fault type FAULT_LOCK_TIMEOUT, it can
be used to inject timeout into lock duration.

Timeout type can be set via /sys/fs/f2fs/<disk>/inject_timeout_type

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
7a127c80b0 f2fs: rename FAULT_TIMEOUT to FAULT_ATOMIC_TIMEOUT
No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:08 +00:00
Chao Yu
6fa1160539 f2fs: fix timeout precision of f2fs_io_schedule_timeout_killable()
Sometimes, f2fs_io_schedule_timeout_killable(HZ) may delay for about 2
seconds, this is because __f2fs_schedule_timeout(DEFAULT_SCHEDULE_TIMEOUT)
may delay for about 2 * DEFAULT_SCHEDULE_TIMEOUT due to its precision, but
we only account the delay as DEFAULT_SCHEDULE_TIMEOUT as below, fix it.

f2fs_io_schedule_timeout_killable()
..
	timeout -= DEFAULT_SCHEDULE_TIMEOUT;

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
da90b67155 f2fs: fix to use jiffies based precision for DEFAULT_SCHEDULE_TIMEOUT
Due to timeout parameter in {io,}_schedule_timeout() is based on jiffies
unit precision. It will lose precision when using msecs_to_jiffies(x)
for conversion.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
b5da276ae6 f2fs: clean up w/ __f2fs_schedule_timeout()
No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
67972c2b89 f2fs: trace elapsed time for io_rwsem lock
Use f2fs_{down,up}_{read,write}_trace for io_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
ce9fe67c9c f2fs: trace elapsed time for cp_global_sem lock
Use f2fs_{down,up}_write_trace for cp_global_sem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
e605302c14 f2fs: trace elapsed time for gc_lock lock
Use f2fs_{down,up}_write_trace for gc_lock to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:07 +00:00
Chao Yu
bb28b66875 f2fs: trace elapsed time for node_write lock
Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
f9f9360251 f2fs: trace elapsed time for node_change lock
Use f2fs_{down,up}_read_trace for node_change to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
66e9e0d55d f2fs: trace elapsed time for cp_rwsem lock
Use f2fs_{down,up}_read_trace for cp_rwsem to trace lock elapsed time.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
e4b75621fc f2fs: sysfs: introduce max_lock_elapsed_time
This patch add a new sysfs node in /sys/fs/f2fs/<device>/max_lock_elapsed_time.

This is a threshold, once a thread enters critical region that lock covers,
total elapsed time exceeds this threshold, f2fs will print tracepoint to dump
information of related context. This sysfs entry can be used to control the
value of threshold, by default, the value is 500 ms.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Chao Yu
79b3cebc70 f2fs: add lock elapsed time trace facility for f2fs rwsemphore
This patch adds lock elapsed time trace facility for f2fs rwsemphore.

If total elapsed time of critical region covered by lock exceeds a
threshold, it will print tracepoint to dump information of lock related
context, including:
- thread information
- CPU/IO priority
- lock information
- elapsed time
 - total time
 - running time (depend on CONFIG_64BIT)
 - runnable time (depend on CONFIG_SCHED_INFO and CONFIG_SCHEDSTATS)
 - io sleep time (depend on CONFIG_TASK_DELAY_ACCT and
		  /proc/sys/kernel/task_delayacct)
 - other time    (by default other time will account nonio sleep time,
                  but, if above kconfig is not defined, other time will
                  include runnable time and/or io sleep time as wll)

output:
    f2fs_lock_elapsed_time: dev = (254,52), comm: sh, pid: 13855, prio: 120, ioprio_class: 2, ioprio_data: 4, lock_name: cp_rwsem, lock_type: rlock, total: 1000, running: 993, runnable: 2, io_sleep: 0, other: 5

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
Daeho Jeong
7ec199117c f2fs: flush plug periodically during GC to maximize readahead effect
During the garbage collection process, F2FS submits readahead I/Os for
valid blocks. However, since the GC loop runs within a single plug scope
without intermediate flushing, these readahead I/Os often accumulate in
the block layer's plug list instead of being dispatched to the device
immediately.

Consequently, when the GC thread attempts to lock the page later, the
I/O might not have completed (or even started), leading to a "read try
and wait" scenario. This negates the benefit of readahead and causes
unnecessary delays in GC latency.

This patch addresses this issue by introducing an intermediate
blk_finish_plug() and blk_start_plug() pair within the GC loop. This
forces the dispatch of pending I/Os, ensuring that readahead pages are
fetched in time, thereby reducing GC latency.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-07 03:17:06 +00:00
ZhaoYueNan
572b1c6f2a f2fs: Update the default value of the documentation ckpt_thread_ioprio
The commit 8a2d9f00d has been updated to set its default value to "rt,3",
fixing the outdated default value in the F2FS documentation.

Signed-off-by: ZhaoYueNan <amktiao030215@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:30:54 +00:00
Yongpeng Yang
9609dd7047 f2fs: remove non-uptodate folio from the page cache in move_data_block
During data movement, move_data_block acquires file folio without
triggering a file read. Such folio are typically not uptodate, they need
to be removed from the page cache after gc complete. This patch marks
folio with the PG_dropbehind flag and uses folio_end_dropbehind to
remove folio from the page cache.

Signed-off-by: Yunlei He <heyunlei@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:30:25 +00:00
Yongpeng Yang
db1a8a7813 f2fs: return immediately after submitting the specified folio in __submit_merged_write_cond
f2fs_folio_wait_writeback ensures the folio write is submitted to the
block layer via __submit_merged_write_cond, then waits for the folio to
complete. Other I/O submissions are irrelevant to
f2fs_folio_wait_writeback. Thus, if the folio write bio is already
submitted, the function can return immediately. This patch adds a
writeback parameter to __submit_merged_write_cond(), which signals an
immediate return after submitting the target folio, and waitting
writeback can use this parameter.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:29:40 +00:00
Yongpeng Yang
86c1cf0578 f2fs: clean up the force parameter in __submit_merged_write_cond()
The force parameter in __submit_merged_write_cond is redundant, where
`force == true` implies `inode == NULL && folio == NULL && ino == 0` is
true, and `force == false` implies `inode != NULL || folio != NULL ||
ino != 0` is true. Thus, this patch replaces the force parameter with
a stack variable force.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:29:35 +00:00
Zhiguo Niu
761dac9073 f2fs: fix to add gc count stat in f2fs_gc_range
It missed the stat count in f2fs_gc_range.

Cc: stable@kernel.org
Fixes: 9bf1dcbdfd ("f2fs: fix to account gc stats correctly")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:27:26 +00:00
Chao Yu
3cb396a2c7 f2fs: fix to do sanity check on nat entry of quota inode
As Zhiguo reported, nat entry of quota inode could be corrupted:

"ino/block_addr=NULL_ADDR in nid=4 entry"

We'd better to do sanity check on quota inode to detect and record
nat.blk_addr inconsistency, so that we can have a chance to repair
it w/ later fsck.

Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:26:56 +00:00
Zhiguo Niu
3250bd41d9 f2fs: remove some redundant codes in f2fs_quota_enable
1. qf_inum has been got and checked in its caller f2fs_enable_quotas
2. f2fs_sb_has_quota_ino has bee checked in its all callers
3. use sbi cleanup F2FS_SB(sb)

Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-01 03:26:08 +00:00
Joanne Chang
4a210a5be2 f2fs: improve check for enough free sections
The check for enough sections in segment.h has the following issues:

1. has_not_enough_free_secs() should return "enough secs" when "free_secs
>= upper_secs", not just structly greater. Conversely, it should only
return "not enough secs" when "free_secs < lower_secs", not when they are
equal. This accounts for the possibility that blocks can fit within
curseg without requiring an additional free section.

2. __get_secs_required() currently separates the needed space to section
and block parts, checking them against free sections and curseg,
respectively. This does not consider the case where curseg cannot hold
the whole block part, but excess free sections beyond the section part
can accommodate some of the block part.

3. has_curseg_enough_space() only checks CURSEG_HOT_DATA for dentry
blocks, but when active_logs=6, they may be placed in WARM and COLD
sections. Also, the current logic does not consider that dentry and data
blocks can be put in the same section when active_logs=2 or 6.

This patch modifies the three functions to address the above issues:

1. Rename has_curseg_enough_space() to get_additional_blocks_required().
Calculate the minimum node, dentry, and data blocks curseg can
accommodate. Then subtract these from the total required blocks of
respective type to determine the worst-case number of blocks that must
be placed in free sections.

2. In __get_secs_required(), get the number of blocks needing new
sections from the new get_additional_blocks_required(). Return the upper
bound of necessary free sections for these blocks. For active_logs=2 or
6, dentry blocks are combined with data blocks.

3. In has_not_enough_free_secs(), get the required sections from
__get_secs_required(), and return “not enough secs” if “free_secs <
required_secs”.

Signed-off-by: Joanne Chang <joannechien@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:49:25 +00:00
Jaegeuk Kim
903c6e95bc f2fs: add a tracepoint to see large folio read submission
For example,

1327.539878: f2fs_preload_pages_start: dev = (252,16), ino = 14, i_size = 4294967296 start: 0, end: 8191
1327.539878: page_cache_sync_ra: dev=252:16 ino=e index=0 req_count=8192 order=9 size=0 async_size=0 ra_pages=4096 mmap_miss=0 prev_pos=-1
1327.539879: page_cache_ra_order: dev=252:16 ino=e index=0 order=9 size=4096 async_size=2048 ra_pages=4096
1327.541895: f2fs_readpages: dev = (252,16), ino = 14, start = 0 nrpage = 4096
1327.541930: f2fs_lookup_extent_tree_start: dev = (252,16), ino = 14, pgofs = 0, type = Read
1327.541931: f2fs_lookup_read_extent_tree_end: dev = (252,16), ino = 14, pgofs = 0, read_ext_info(fofs: 0, len: 1048576, blk: 4221440)
1327.541931: f2fs_map_blocks: dev = (252,16), ino = 14, file offset = 0, start blkaddr = 0x406a00, len = 0x1000, flags = 2, seg_type = 8, may_create = 0, multidevice = 0, flag = 0, err = 0
1327.541989: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 0, nr_pages = 512, dirty = 0, uptodate = 0
1327.542012: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 512, nr_pages = 512, dirty = 0, uptodate = 0
1327.542036: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 1024, nr_pages = 512, dirty = 0, uptodate = 0
1327.542080: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 1536, nr_pages = 512, dirty = 0, uptodate = 0
1327.542127: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 2048, nr_pages = 512, dirty = 0, uptodate = 0
1327.542151: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 2560, nr_pages = 512, dirty = 0, uptodate = 0
1327.542196: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 3072, nr_pages = 512, dirty = 0, uptodate = 0
1327.542219: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 3584, nr_pages = 512, dirty = 0, uptodate = 0
1327.542239: f2fs_submit_read_bio: dev = (252,16)/(252,16), rw = READ(R), DATA, sector = 33771520, size = 16777216
1327.542269: page_cache_sync_ra: dev=252:16 ino=e index=4096 req_count=8192 order=9 size=4096 async_size=2048 ra_pages=4096 mmap_miss=0 prev_pos=-1
1327.542289: page_cache_ra_order: dev=252:16 ino=e index=4096 order=9 size=4096 async_size=2048 ra_pages=4096
1327.544485: f2fs_readpages: dev = (252,16), ino = 14, start = 4096 nrpage = 4096
1327.544521: f2fs_lookup_extent_tree_start: dev = (252,16), ino = 14, pgofs = 4096, type = Read
1327.544521: f2fs_lookup_read_extent_tree_end: dev = (252,16), ino = 14, pgofs = 4096, read_ext_info(fofs: 0, len: 1048576, blk: 4221440)
1327.544522: f2fs_map_blocks: dev = (252,16), ino = 14, file offset = 4096, start blkaddr = 0x407a00, len = 0x1000, flags = 2, seg_type = 8, may_create = 0, multidevice = 0, flag = 0, err = 0
1327.544550: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 4096, nr_pages = 512, dirty = 0, uptodate = 0
1327.544575: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 4608, nr_pages = 512, dirty = 0, uptodate = 0
1327.544601: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 5120, nr_pages = 512, dirty = 0, uptodate = 0
1327.544647: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 5632, nr_pages = 512, dirty = 0, uptodate = 0
1327.544692: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 6144, nr_pages = 512, dirty = 0, uptodate = 0
1327.544734: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 6656, nr_pages = 512, dirty = 0, uptodate = 0
1327.544777: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 7168, nr_pages = 512, dirty = 0, uptodate = 0
1327.544805: f2fs_read_folio: dev = (252,16), ino = 14, DATA, FILE, index = 7680, nr_pages = 512, dirty = 0, uptodate = 0
1327.544826: f2fs_submit_read_bio: dev = (252,16)/(252,16), rw = READ(R), DATA, sector = 33804288, size = 16777216
1327.544852: f2fs_preload_pages_end: dev = (252,16), ino = 14, i_size = 4294967296 start: 8192, end: 8191

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:46:49 +00:00
Jaegeuk Kim
05e65c14ea f2fs: support large folio for immutable non-compressed case
This patch enables large folio for limited case where we can get the high-order
memory allocation. It supports the encrypted and fsverity files, which are
essential for Android environment.

How to test:
- dd if=/dev/zero of=/mnt/test/test bs=1G count=4
- f2fs_io setflags immutable /mnt/test/test
- echo 3 > /proc/sys/vm/drop_caches
 : to reload inode with large folio
- f2fs_io read 32 0 1024 mmap 0 0 /mnt/test/test

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-12-16 00:46:49 +00:00
Linus Torvalds
8f0b4cce44 Linux 6.19-rc1 2025-12-14 16:05:07 +12:00