This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().
So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 4bc3477796.
Let's apply a better approach to flush the only dirty pages committed by user
to avoid the delay caused by unncessary incoming ones.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In error path of f2fs_read_data_large_folio(), if bio is valid, it
may submit bio twice, fix it.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Xiaolong Guo reported a f2fs bug in bugzilla [1]
[1] https://bugzilla.kernel.org/show_bug.cgi?id=220951
Quoted:
"When using stress-ng's swap stress test on F2FS filesystem with kernel 6.6+,
the system experiences data corruption leading to either:
1 dm-verity corruption errors and device reboot
2 F2FS node corruption errors and boot hangs
The issue occurs specifically when:
1 Using F2FS filesystem (ext4 is unaffected)
2 Swapfile size is less than F2FS section size (2MB)
3 Swapfile has fragmented physical layout (multiple non-contiguous extents)
4 Kernel version is 6.6+ (6.1 is unaffected)
The root cause is in check_swap_activate() function in fs/f2fs/data.c. When the
first extent of a small swapfile (< 2MB) is not aligned to section boundaries,
the function incorrectly treats it as the last extent, failing to map
subsequent extents. This results in incorrect swap_extent creation where only
the first extent is mapped, causing subsequent swap writes to overwrite wrong
physical locations (other files' data).
Steps to Reproduce
1 Setup a device with F2FS-formatted userdata partition
2 Compile stress-ng from https://github.com/ColinIanKing/stress-ng
3 Run swap stress test: (Android devices)
adb shell "cd /data/stressng; ./stress-ng-64 --metrics-brief --timeout 60
--swap 0"
Log:
1 Ftrace shows in kernel 6.6, only first extent is mapped during second
f2fs_map_blocks call in check_swap_activate():
stress-ng-swap-8990: f2fs_map_blocks: ino=11002, file offset=0, start
blkaddr=0x43143, len=0x1
(Only 4KB mapped, not the full swapfile)
2 in kernel 6.1, both extents are correctly mapped:
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=0, start
blkaddr=0x13cd4, len=0x1
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=1, start
blkaddr=0x60c84b, len=0xff
The problematic code is in check_swap_activate():
if ((pblock - SM_I(sbi)->main_blkaddr) % blks_per_sec ||
nr_pblocks % blks_per_sec ||
!f2fs_valid_pinned_area(sbi, pblock)) {
bool last_extent = false;
not_aligned++;
nr_pblocks = roundup(nr_pblocks, blks_per_sec);
if (cur_lblock + nr_pblocks > sis->max)
nr_pblocks -= blks_per_sec;
/* this extent is last one */
if (!nr_pblocks) {
nr_pblocks = last_lblock - cur_lblock;
last_extent = true;
}
ret = f2fs_migrate_blocks(inode, cur_lblock, nr_pblocks);
if (ret) {
if (ret == -ENOENT)
ret = -EINVAL;
goto out;
}
if (!last_extent)
goto retry;
}
When the first extent is unaligned and roundup(nr_pblocks, blks_per_sec)
exceeds sis->max, we subtract blks_per_sec resulting in nr_pblocks = 0. The
code then incorrectly assumes this is the last extent, sets nr_pblocks =
last_lblock - cur_lblock (entire swapfile), and performs migration. After
migration, it doesn't retry mapping, so subsequent extents are never processed.
"
In order to fix this issue, we need to lookup block mapping info after
we migrate all blocks in the tail of swapfile.
Cc: stable@kernel.org
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Cc: Daeho Jeong <daehojeong@google.com>
Reported-and-tested-by: Xiaolong Guo <guoxiaolong2008@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220951
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For consecutive large hole mapping across {d,id,did}nodes , we don't
need to call f2fs_map_blocks() to check one hole block per one time,
instead, we can use map.m_next_pgofs as a hint of next potential valid
block, so that we can skip calling f2fs_map_blocks the range of
[cur_pgofs + 1, .m_next_pgofs).
1) regular case
touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
Before:
real 0m0.706s
user 0m0.000s
sys 0m0.706s
After:
real 0m0.620s
user 0m0.008s
sys 0m0.611s
2) large folio case
touch /mnt/f2fs/file
truncate -s $((1024*1024*1024)) /mnt/f2fs/file
f2fs_io setflags immutable /mnt/f2fs/file
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/mnt/f2fs/file of=/dev/null bs=1M count=1024
Before:
real 0m0.438s
user 0m0.004s
sys 0m0.433s
After:
real 0m0.368s
user 0m0.004s
sys 0m0.364s
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_read_data_large_folio(), the block zeroing path calls
folio_zero_range() and then continues the loop. However, it fails to
advance index and offset before continuing.
This can cause the loop to repeatedly process the same subpage of the
folio, leading to stalls/hangs and incorrect progress when reading large
folios with holes/zeroed blocks.
Fix it by advancing index and offset unconditionally in the loop
iteration, so they are updated even when the zeroing path continues.
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_read_data_large_folio() can build a single read BIO across multiple
folios during readahead. If a folio ends up having none of its subpages
added to the BIO (e.g. all subpages are zeroed / treated as holes), it
will never be seen by f2fs_finish_read_bio(), so folio_end_read() is
never called. This leaves the folio locked and not marked uptodate.
Track whether the current folio has been added to a BIO via a local
'folio_in_bio' bool flag, and when iterating readahead folios, explicitly
mark the folio uptodate (on success) and unlock it when nothing was added.
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the second call to f2fs_map_blocks within f2fs_read_data_large_folio,
map.m_len exceeds the logical address space to be read. This patch
ensures map.m_len does not exceed the required address space.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----------[ cut here ]------------
kernel BUG at fs/f2fs/data.c:358!
Call Trace:
<IRQ>
blk_update_request+0x5eb/0xe70 block/blk-mq.c:987
blk_mq_end_request+0x3e/0x70 block/blk-mq.c:1149
blk_complete_reqs block/blk-mq.c:1224 [inline]
blk_done_softirq+0x107/0x160 block/blk-mq.c:1229
handle_softirqs+0x283/0x870 kernel/softirq.c:579
__do_softirq kernel/softirq.c:613 [inline]
invoke_softirq kernel/softirq.c:453 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
</IRQ>
In f2fs_write_end_io(), it detects there is inconsistency in between
node page index (nid) and footer.nid of node page.
If footer of node page is corrupted in fuzzed image, then we load corrupted
node page w/ async method, e.g. f2fs_ra_node_pages() or f2fs_ra_node_page(),
in where we won't do sanity check on node footer, once node page becomes
dirty, we will encounter this bug after node page writeback.
Cc: stable@kernel.org
Reported-by: syzbot+803dd716c4310d16ff3a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=803dd716c4310d16ff3a
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add node footer sanity check during node folio's writeback, if sanity
check fails, let's shutdown filesystem to avoid looping to redirty
and writeback in .writepages.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently, F2FS requires the packed_ssa feature to be enabled when
utilizing non-4KB block sizes (e.g., 16KB). This restriction limits
the flexibility of filesystem formatting options.
This patch allows F2FS to support non-4KB block sizes even when the
packed_ssa feature is disabled. It adjusts the SSA calculation logic to
correctly handle summary entries in larger blocks without the packed
layout.
Cc: stable@kernel.org
Fixes: 7ee8bc3942 ("f2fs: revert summary entry count from 2048 to 512 in 16kb block support")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__blkdev_issue_discard() in __submit_discard_cmd() will never fail, so
let's make FAULT_DISCARD fault injection obsolete.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As syzbot reported an use-after-free issue in f2fs_write_end_io().
It is caused by below race condition:
loop device umount
- worker_thread
- loop_process_work
- do_req_filebacked
- lo_rw_aio
- lo_rw_aio_complete
- blk_mq_end_request
- blk_update_request
- f2fs_write_end_io
- dec_page_count
- folio_end_writeback
- kill_f2fs_super
- kill_block_super
- f2fs_put_super
: free(sbi)
: get_pages(, F2FS_WB_CP_DATA)
accessed sbi which is freed
In kill_f2fs_super(), we will drop all page caches of f2fs inodes before
call free(sbi), it guarantee that all folios should end its writeback, so
it should be safe to access sbi before last folio_end_writeback().
Let's relocate ckpt thread wakeup flow before folio_end_writeback() to
resolve this issue.
Cc: stable@kernel.org
Fixes: e234088758 ("f2fs: avoid wait if IO end up when do_checkpoint for better performance")
Reported-by: syzbot+b4444e3c972a7a124187@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4444e3c972a7a124187
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sysfs entry name is gc_pin_file_thresh instead of gc_pin_file_threshold,
fix it.
Cc: stable@kernel.org
Fixes: c521a6ab4a ("f2fs: fix to limit gc_pin_file_threshold")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During SPO tests, when mounting F2FS, an -EINVAL error was returned from
f2fs_recover_inode_page. The issue occurred under the following scenario
Thread A Thread B
f2fs_ioc_commit_atomic_write
- f2fs_do_sync_file // atomic = true
- f2fs_fsync_node_pages
: last_folio = inode folio
: schedule before folio_lock(last_folio) f2fs_write_checkpoint
- block_operations// writeback last_folio
- schedule before f2fs_flush_nat_entries
: set_fsync_mark(last_folio, 1)
: set_dentry_mark(last_folio, 1)
: folio_mark_dirty(last_folio)
- __write_node_folio(last_folio)
: f2fs_down_read(&sbi->node_write)//block
- f2fs_flush_nat_entries
: {struct nat_entry}->flag |= BIT(IS_CHECKPOINTED)
- unblock_operations
: f2fs_up_write(&sbi->node_write)
f2fs_write_checkpoint//return
: f2fs_do_write_node_page()
f2fs_ioc_commit_atomic_write//return
SPO
Thread A calls f2fs_need_dentry_mark(sbi, ino), and the last_folio has
already been written once. However, the {struct nat_entry}->flag did not
have the IS_CHECKPOINTED set, causing set_dentry_mark(last_folio, 1) and
write last_folio again after Thread B finishes f2fs_write_checkpoint.
After SPO and reboot, it was detected that {struct node_info}->blk_addr
was not NULL_ADDR because Thread B successfully write the checkpoint.
This issue only occurs in atomic write scenarios. For regular file
fsync operations, the folio must be dirty. If
block_operations->f2fs_sync_node_pages successfully submit the folio
write, this path will not be executed. Otherwise, the
f2fs_write_checkpoint will need to wait for the folio write submission
to complete, as sbi->nr_pages[F2FS_DIRTY_NODES] > 0. Therefore, the
situation where f2fs_need_dentry_mark checks that the {struct
nat_entry}->flag /wo the IS_CHECKPOINTED flag, but the folio write has
already been submitted, will not occur.
Therefore, for atomic file fsync, sbi->node_write should be acquired
through __write_node_folio to ensure that the IS_CHECKPOINTED flag
correctly indicates that the checkpoint write has been completed.
Fixes: 608514deba ("f2fs: set fsync mark only for the last dnode")
Cc: stable@kernel.org
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
{struct file_ra_state}->ra_pages and {struct bio}->bi_iter.bi_size is
defined as unsigned int, so values of seq_file_ra_mul and max_io_bytes
exceeding UINT_MAX are meaningless.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Some f2fs sysfs attributes suffer from out-of-bounds memory access and
incorrect handling of integer values whose size is not 4 bytes.
For example:
vm:~# echo 65537 > /sys/fs/f2fs/vde/carve_out
vm:~# cat /sys/fs/f2fs/vde/carve_out
65537
vm:~# echo 4294967297 > /sys/fs/f2fs/vde/atgc_age_threshold
vm:~# cat /sys/fs/f2fs/vde/atgc_age_threshold
1
carve_out maps to {struct f2fs_sb_info}->carve_out, which is a 8-bit
integer. However, the sysfs interface allows setting it to a value
larger than 255, resulting in an out-of-range update.
atgc_age_threshold maps to {struct atgc_management}->age_threshold,
which is a 64-bit integer, but its sysfs interface cannot correctly set
values larger than UINT_MAX.
The root causes are:
1. __sbi_store() treats all default values as unsigned int, which
prevents updating integers larger than 4 bytes and causes out-of-bounds
writes for integers smaller than 4 bytes.
2. f2fs_sbi_show() also assumes all default values are unsigned int,
leading to out-of-bounds reads and incorrect access to integers larger
than 4 bytes.
This patch introduces {struct f2fs_attr}->size to record the actual size
of the integer associated with each sysfs attribute. With this
information, sysfs read and write operations can correctly access and
update values according to their real data size, avoiding memory
corruption and truncation.
Fixes: b59d0bae6c ("f2fs: add sysfs support for controlling the gc_thread")
Cc: stable@kernel.org
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_read_data_large_folio(), read_pages_pending is incremented only
after the subpage has been added to the BIO. With a heavily fragmented
file, each new subpage can force submission of the previous BIO.
If the BIO completes quickly, f2fs_finish_read_bio() may decrement
read_pages_pending to zero and call folio_end_read() while the read loop
is still processing other subpages of the same large folio.
Fix the ordering by incrementing read_pages_pending before any possible
BIO submission for the current subpage, matching the iomap ordering and
preventing premature folio_end_read().
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_folio_state is attached to folio->private and is expected to start
with read_pages_pending == 0. However, the structure was allocated from
ffs_entry_slab without being fully initialized, which can leave
read_pages_pending with stale values.
Allocate the object with __GFP_ZERO so all fields are reliably zeroed at
creation time.
Signed-off-by: Nanzhe Zhao <nzzhao@126.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a new sysfs node in /sys/fs/f2fs/<disk>/inject_lock_timeout,
it relies on CONFIG_F2FS_FAULT_INJECTION kernel config.
It can be used to simulate different type of timeout in lock duration.
========== ===============================
Flag_Value Flag_Description
========== ===============================
0x00000000 No timeout (default)
0x00000001 Simulate running time
0x00000002 Simulate IO type sleep time
0x00000003 Simulate Non-IO type sleep time
0x00000004 Simulate runnable time
========== ===============================
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduce a new fault type FAULT_LOCK_TIMEOUT, it can
be used to inject timeout into lock duration.
Timeout type can be set via /sys/fs/f2fs/<disk>/inject_timeout_type
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes, f2fs_io_schedule_timeout_killable(HZ) may delay for about 2
seconds, this is because __f2fs_schedule_timeout(DEFAULT_SCHEDULE_TIMEOUT)
may delay for about 2 * DEFAULT_SCHEDULE_TIMEOUT due to its precision, but
we only account the delay as DEFAULT_SCHEDULE_TIMEOUT as below, fix it.
f2fs_io_schedule_timeout_killable()
..
timeout -= DEFAULT_SCHEDULE_TIMEOUT;
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Due to timeout parameter in {io,}_schedule_timeout() is based on jiffies
unit precision. It will lose precision when using msecs_to_jiffies(x)
for conversion.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_{read,write}_trace for io_rwsem to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_write_trace for cp_global_sem to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_write_trace for gc_lock to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_read_trace for node_write to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_read_trace for node_change to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_{down,up}_read_trace for cp_rwsem to trace lock elapsed time.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add a new sysfs node in /sys/fs/f2fs/<device>/max_lock_elapsed_time.
This is a threshold, once a thread enters critical region that lock covers,
total elapsed time exceeds this threshold, f2fs will print tracepoint to dump
information of related context. This sysfs entry can be used to control the
value of threshold, by default, the value is 500 ms.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds lock elapsed time trace facility for f2fs rwsemphore.
If total elapsed time of critical region covered by lock exceeds a
threshold, it will print tracepoint to dump information of lock related
context, including:
- thread information
- CPU/IO priority
- lock information
- elapsed time
- total time
- running time (depend on CONFIG_64BIT)
- runnable time (depend on CONFIG_SCHED_INFO and CONFIG_SCHEDSTATS)
- io sleep time (depend on CONFIG_TASK_DELAY_ACCT and
/proc/sys/kernel/task_delayacct)
- other time (by default other time will account nonio sleep time,
but, if above kconfig is not defined, other time will
include runnable time and/or io sleep time as wll)
output:
f2fs_lock_elapsed_time: dev = (254,52), comm: sh, pid: 13855, prio: 120, ioprio_class: 2, ioprio_data: 4, lock_name: cp_rwsem, lock_type: rlock, total: 1000, running: 993, runnable: 2, io_sleep: 0, other: 5
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During the garbage collection process, F2FS submits readahead I/Os for
valid blocks. However, since the GC loop runs within a single plug scope
without intermediate flushing, these readahead I/Os often accumulate in
the block layer's plug list instead of being dispatched to the device
immediately.
Consequently, when the GC thread attempts to lock the page later, the
I/O might not have completed (or even started), leading to a "read try
and wait" scenario. This negates the benefit of readahead and causes
unnecessary delays in GC latency.
This patch addresses this issue by introducing an intermediate
blk_finish_plug() and blk_start_plug() pair within the GC loop. This
forces the dispatch of pending I/Os, ensuring that readahead pages are
fetched in time, thereby reducing GC latency.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The commit 8a2d9f00d has been updated to set its default value to "rt,3",
fixing the outdated default value in the F2FS documentation.
Signed-off-by: ZhaoYueNan <amktiao030215@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
During data movement, move_data_block acquires file folio without
triggering a file read. Such folio are typically not uptodate, they need
to be removed from the page cache after gc complete. This patch marks
folio with the PG_dropbehind flag and uses folio_end_dropbehind to
remove folio from the page cache.
Signed-off-by: Yunlei He <heyunlei@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_folio_wait_writeback ensures the folio write is submitted to the
block layer via __submit_merged_write_cond, then waits for the folio to
complete. Other I/O submissions are irrelevant to
f2fs_folio_wait_writeback. Thus, if the folio write bio is already
submitted, the function can return immediately. This patch adds a
writeback parameter to __submit_merged_write_cond(), which signals an
immediate return after submitting the target folio, and waitting
writeback can use this parameter.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The force parameter in __submit_merged_write_cond is redundant, where
`force == true` implies `inode == NULL && folio == NULL && ino == 0` is
true, and `force == false` implies `inode != NULL || folio != NULL ||
ino != 0` is true. Thus, this patch replaces the force parameter with
a stack variable force.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Zhiguo reported, nat entry of quota inode could be corrupted:
"ino/block_addr=NULL_ADDR in nid=4 entry"
We'd better to do sanity check on quota inode to detect and record
nat.blk_addr inconsistency, so that we can have a chance to repair
it w/ later fsck.
Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. qf_inum has been got and checked in its caller f2fs_enable_quotas
2. f2fs_sb_has_quota_ino has bee checked in its all callers
3. use sbi cleanup F2FS_SB(sb)
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The check for enough sections in segment.h has the following issues:
1. has_not_enough_free_secs() should return "enough secs" when "free_secs
>= upper_secs", not just structly greater. Conversely, it should only
return "not enough secs" when "free_secs < lower_secs", not when they are
equal. This accounts for the possibility that blocks can fit within
curseg without requiring an additional free section.
2. __get_secs_required() currently separates the needed space to section
and block parts, checking them against free sections and curseg,
respectively. This does not consider the case where curseg cannot hold
the whole block part, but excess free sections beyond the section part
can accommodate some of the block part.
3. has_curseg_enough_space() only checks CURSEG_HOT_DATA for dentry
blocks, but when active_logs=6, they may be placed in WARM and COLD
sections. Also, the current logic does not consider that dentry and data
blocks can be put in the same section when active_logs=2 or 6.
This patch modifies the three functions to address the above issues:
1. Rename has_curseg_enough_space() to get_additional_blocks_required().
Calculate the minimum node, dentry, and data blocks curseg can
accommodate. Then subtract these from the total required blocks of
respective type to determine the worst-case number of blocks that must
be placed in free sections.
2. In __get_secs_required(), get the number of blocks needing new
sections from the new get_additional_blocks_required(). Return the upper
bound of necessary free sections for these blocks. For active_logs=2 or
6, dentry blocks are combined with data blocks.
3. In has_not_enough_free_secs(), get the required sections from
__get_secs_required(), and return “not enough secs” if “free_secs <
required_secs”.
Signed-off-by: Joanne Chang <joannechien@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables large folio for limited case where we can get the high-order
memory allocation. It supports the encrypted and fsverity files, which are
essential for Android environment.
How to test:
- dd if=/dev/zero of=/mnt/test/test bs=1G count=4
- f2fs_io setflags immutable /mnt/test/test
- echo 3 > /proc/sys/vm/drop_caches
: to reload inode with large folio
- f2fs_io read 32 0 1024 mmap 0 0 /mnt/test/test
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>