Remove bcachefs core code

bcachefs was marked 'externally maintained' in 6.17 but the code
remained to make the transition smoother.

It's now a DKMS module, making the in-kernel code stale, so remove
it to avoid any version confusion.

Link: https://lore.kernel.org/linux-bcachefs/yokpt2d2g2lluyomtqrdvmkl3amv3kgnipmenobkpgx537kay7@xgcgjviv3n7x/T/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
Linus Torvalds 2025-09-29 13:43:52 -07:00
parent ee916dccd4
commit f2c61db29f
284 changed files with 0 additions and 117483 deletions

View File

@ -1,186 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0
bcachefs coding style
=====================
Good development is like gardening, and codebases are our gardens. Tend to them
every day; look for little things that are out of place or in need of tidying.
A little weeding here and there goes a long way; don't wait until things have
spiraled out of control.
Things don't always have to be perfect - nitpicking often does more harm than
good. But appreciate beauty when you see it - and let people know.
The code that you are afraid to touch is the code most in need of refactoring.
A little organizing here and there goes a long way.
Put real thought into how you organize things.
Good code is readable code, where the structure is simple and leaves nowhere
for bugs to hide.
Assertions are one of our most important tools for writing reliable code. If in
the course of writing a patchset you encounter a condition that shouldn't
happen (and will have unpredictable or undefined behaviour if it does), or
you're not sure if it can happen and not sure how to handle it yet - make it a
BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
By the time you finish the patchset, you should understand better which
assertions need to be handled and turned into checks with error paths, and
which should be logically impossible. Leave the BUG_ON()s in for the ones which
are logically impossible. (Or, make them debug mode assertions if they're
expensive - but don't turn everything into a debug mode assertion, so that
we're not stuck debugging undefined behaviour should it turn out that you were
wrong).
Assertions are documentation that can't go out of date. Good assertions are
wonderful.
Good assertions drastically and dramatically reduce the amount of testing
required to shake out bugs.
Good assertions are based on state, not logic. To write good assertions, you
have to think about what the invariants on your state are.
Good invariants and assertions will hold everywhere in your codebase. This
means that you can run them in only a few places in the checked in version, but
should you need to debug something that caused the assertion to fail, you can
quickly shotgun them everywhere to find the codepath that broke the invariant.
A good assertion checks something that the compiler could check for us, and
elide - if we were working in a language with embedded correctness proofs that
the compiler could check. This is something that exists today, but it'll likely
still be a few decades before it comes to systems programming languages. But we
can still incorporate that kind of thinking into our code and document the
invariants with runtime checks - much like the way people working in
dynamically typed languages may add type annotations, gradually making their
code statically typed.
Looking for ways to make your assertions simpler - and higher level - will
often nudge you towards making the entire system simpler and more robust.
Good code is code where you can poke around and see what it's doing -
introspection. We can't debug anything if we can't see what's going on.
Whenever we're debugging, and the solution isn't immediately obvious, if the
issue is that we don't know where the issue is because we can't see what's
going on - fix that first.
We have the tools to make anything visible at runtime, efficiently - RCU and
percpu data structures among them. Don't let things stay hidden.
The most important tool for introspection is the humble pretty printer - in
bcachefs, this means `*_to_text()` functions, which output to printbufs.
Pretty printers are wonderful, because they compose and you can use them
everywhere. Having functions to print whatever object you're working with will
make your error messages much easier to write (therefore they will actually
exist) and much more informative. And they can be used from sysfs/debugfs, as
well as tracepoints.
Runtime info and debugging tools should come with clear descriptions and
labels, and good structure - we don't want files with a list of bare integers,
like in procfs. Part of the job of the debugging tools is to educate users and
new developers as to how the system works.
Error messages should, whenever possible, tell you everything you need to debug
the issue. It's worth putting effort into them.
Tracepoints shouldn't be the first thing you reach for. They're an important
tool, but always look for more immediate ways to make things visible. When we
have to rely on tracing, we have to know which tracepoints we're looking for,
and then we have to run the troublesome workload, and then we have to sift
through logs. This is a lot of steps to go through when a user is hitting
something, and if it's intermittent it may not even be possible.
The humble counter is an incredibly useful tool. They're cheap and simple to
use, and many complicated internal operations with lots of things that can
behave weirdly (anything involving memory reclaim, for example) become
shockingly easy to debug once you have counters on every distinct codepath.
Persistent counters are even better.
When debugging, try to get the most out of every bug you come across; don't
rush to fix the initial issue. Look for things that will make related bugs
easier the next time around - introspection, new assertions, better error
messages, new debug tools, and do those first. Look for ways to make the system
better behaved; often one bug will uncover several other bugs through
downstream effects.
Fix all that first, and then the original bug last - even if that means keeping
a user waiting. They'll thank you in the long run, and when they understand
what you're doing you'll be amazed at how patient they're happy to be. Users
like to help - otherwise they wouldn't be reporting the bug in the first place.
Talk to your users. Don't isolate yourself.
Users notice all sorts of interesting things, and by just talking to them and
interacting with them you can benefit from their experience.
Spend time doing support and helpdesk stuff. Don't just write code - code isn't
finished until it's being used trouble free.
This will also motivate you to make your debugging tools as good as possible,
and perhaps even your documentation, too. Like anything else in life, the more
time you spend at it the better you'll get, and you the developer are the
person most able to improve the tools to make debugging quick and easy.
Be wary of how you take on and commit to big projects. Don't let development
become product-manager focused. Often time an idea is a good one but needs to
wait for its proper time - but you won't know if it's the proper time for an
idea until you start writing code.
Expect to throw a lot of things away, or leave them half finished for later.
Nobody writes all perfect code that all gets shipped, and you'll be much more
productive in the long run if you notice this early and shift to something
else. The experience gained and lessons learned will be valuable for all the
other work you do.
But don't be afraid to tackle projects that require significant rework of
existing code. Sometimes these can be the best projects, because they can lead
us to make existing code more general, more flexible, more multipurpose and
perhaps more robust. Just don't hesitate to abandon the idea if it looks like
it's going to make a mess of things.
Complicated features can often be done as a series of refactorings, with the
final change that actually implements the feature as a quite small patch at the
end. It's wonderful when this happens, especially when those refactorings are
things that improve the codebase in their own right. When that happens there's
much less risk of wasted effort if the feature you were going for doesn't work
out.
Always strive to work incrementally. Always strive to turn the big projects
into little bite sized projects that can prove their own merits.
Instead of always tackling those big projects, look for little things that
will be useful, and make the big projects easier.
The question of what's likely to be useful is where junior developers most
often go astray - doing something because it seems like it'll be useful often
leads to overengineering. Knowing what's useful comes from many years of
experience, or talking with people who have that experience - or from simply
reading lots of code and looking for common patterns and issues. Don't be
afraid to throw things away and do something simpler.
Talk about your ideas with your fellow developers; often times the best things
come from relaxed conversations where people aren't afraid to say "what if?".
Don't neglect your tools.
The most important tools (besides the compiler and our text editor) are the
tools we use for testing. The shortest possible edit/test/debug cycle is
essential for working productively. We learn, gain experience, and discover the
errors in our thinking by running our code and seeing what happens. If your
time is being wasted because your tools are bad or too slow - don't accept it,
fix it.
Put effort into your documentation, commit messages, and code comments - but
don't go overboard. A good commit message is wonderful - but if the information
was important enough to go in a commit message, ask yourself if it would be
even better as a code comment.
A good code comment is wonderful, but even better is the comment that didn't
need to exist because the code was so straightforward as to be obvious;
organized into small clean and tidy modules, with clear and descriptive names
for functions and variables, where every line of code has a clear purpose.

View File

@ -1,105 +0,0 @@
Submitting patches to bcachefs
==============================
Here are suggestions for submitting patches to bcachefs subsystem.
Submission checklist
--------------------
Patches must be tested before being submitted, either with the xfstests suite
[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
touched. Note that ktest wraps xfstests and will be an easier method to running
it for most users; it includes single-command wrappers for all the mainstream
in-kernel local filesystems.
Patches will undergo more testing after being merged (including
lockdep/kasan/preempt/etc. variants), these are not generally required to be
run by the submitter - but do put some thought into what you're changing and
which tests might be relevant, e.g. are you dealing with tricky memory layout
work? kasan, are you doing locking work? then lockdep; and ktest includes
single-command variants for the debug build types you'll most likely need.
The exception to this rule is incomplete WIP/RFC patches: if you're working on
something nontrivial, it's encouraged to send out a WIP patch to let people
know what you're doing and make sure you're on the right track. Just make sure
it includes a brief note as to what's done and what's incomplete, to avoid
confusion.
Rigorous checkpatch.pl adherence is not required (many of its warnings are
considered out of date), but try not to deviate too much without reason.
Focus on writing code that reads well and is organized well; code should be
aesthetically pleasing.
CI
--
Instead of running your tests locally, when running the full test suite it's
preferable to let a server farm do it in parallel, and then have the results
in a nice test dashboard (which can tell you which failures are new, and
presents results in a git log view, avoiding the need for most bisecting).
That exists [2]_, and community members may request an account. If you work for
a big tech company, you'll need to help out with server costs to get access -
but the CI is not restricted to running bcachefs tests: it runs any ktest test
(which generally makes it easy to wrap other tests that can run in qemu).
Other things to think about
---------------------------
- How will we debug this code? Is there sufficient introspection to diagnose
when something starts acting wonky on a user machine?
We don't necessarily need every single field of every data structure visible
with introspection, but having the important fields of all the core data
types wired up makes debugging drastically easier - a bit of thoughtful
foresight greatly reduces the need to have people build custom kernels with
debug patches.
More broadly, think about all the debug tooling that might be needed.
- Does it make the codebase more or less of a mess? Can we also try to do some
organizing, too?
- Do new tests need to be written? New assertions? How do we know and verify
that the code is correct, and what happens if something goes wrong?
We don't yet have automated code coverage analysis or easy fault injection -
but for now, pretend we did and ask what they might tell us.
Assertions are hugely important, given that we don't yet have a systems
language that can do ergonomic embedded correctness proofs. Hitting an assert
in testing is much better than wandering off into undefined behaviour la-la
land - use them. Use them judiciously, and not as a replacement for proper
error handling, but use them.
- Does it need to be performance tested? Should we add new performance counters?
bcachefs has a set of persistent runtime counters which can be viewed with
the 'bcachefs fs top' command; this should give users a basic idea of what
their filesystem is currently doing. If you're doing a new feature or looking
at old code, think if anything should be added.
- If it's a new on disk format feature - have upgrades and downgrades been
tested? (Automated tests exists but aren't in the CI, due to the hassle of
disk image management; coordinate to have them run.)
Mailing list, IRC
-----------------
Patches should hit the list [3]_, but much discussion and code review happens
on IRC as well [4]_; many people appreciate the more conversational approach
and quicker feedback.
Additionally, we have a lively user community doing excellent QA work, which
exists primarily on IRC. Please make use of that resource; user feedback is
important for any nontrivial feature, and documenting it in commit messages
would be a good idea.
.. rubric:: References
.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
.. [1] https://evilpiepirate.org/git/ktest.git/
.. [2] https://evilpiepirate.org/~testdashboard/ci/
.. [3] linux-bcachefs@vger.kernel.org
.. [4] irc.oftc.net#bcache, #bcachefs-dev

View File

@ -1,108 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0
Casefolding
===========
bcachefs has support for case-insensitive file and directory
lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
casefolding attributes.
The main usecase for casefolding is compatibility with software written
against other filesystems that rely on casefolded lookups
(eg. NTFS and Wine/Proton).
Taking advantage of file-system level casefolding can lead to great
loading time gains in many applications and games.
Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
Once a directory has been flagged for casefolding, a feature bit
is enabled on the superblock which marks the filesystem as using
casefolding.
When the feature bit for casefolding is enabled, it is no longer possible
to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
On the lookup/query side: casefolding is implemented by allocating a new
string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
casefold the query string.
On the dirent side: casefolding is implemented by ensuring the `bkey`'s
hash is made from the casefolded string and storing the cached casefolded
name with the regular name in the dirent.
The structure looks like this:
* Regular: [dirent data][regular name][nul][nul]...
* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
(Do note, the number of NULs here is merely for illustration; their count can
vary per-key, and they may not even be present if the key is aligned to
`sizeof(u64)`.)
This is efficient as it means that for all file lookups that require casefolding,
it has identical performance to a regular lookup:
a hash comparison and a `memcmp` of the name.
Rationale
---------
Several designs were considered for this system:
One was to introduce a dirent_v2, however that would be painful especially as
the hash system only has support for a single key type. This would also need
`BCH_NAME_MAX` to change between versions, and a new feature bit.
Another option was to store without the two lengths, and just take the length of
the regular name and casefolded name contiguously / 2 as the length. This would
assume that the regular length == casefolded length, but that could potentially
not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
the lowercase unicode glyph.
It would be possible to disregard the casefold cache for those cases, but it was
decided to simply encode the two string lengths in the key to avoid random
performance issues if this edgecase was ever hit.
The option settled on was to use a free-bit in d_type to mark a dirent as having
a casefold cache, and then treat the first 4 bytes the name block as lengths.
You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
The feature bit was used to allow casefolding support to be enabled for the majority
of users, but some allow users who have no need for the feature to still use bcachefs as
`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
which may be decider between using bcachefs for eg. embedded platforms.
Other filesystems like ext4 and f2fs have a super-block level option for casefolding
encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
any encodings than a single UTF-8 version. When future encodings are desirable,
they will be added trivially using the opts mechanism.
dentry/dcache considerations
----------------------------
Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
negative dentry's.
This is because currently doing so presents a problem in the following scenario:
- Lookup file "blAH" in a casefolded directory
- Creation of file "BLAH" in a casefolded directory
- Lookup file "blAH" in a casefolded directory
This would fail if negative dentry's were cached.
This is slightly suboptimal, but could be fixed in future with some vfs work.
References
----------
(from Peter Anvin, on the list)
It is worth noting that Microsoft has basically declared their
"recommended" case folding (upcase) table to be permanently frozen (for
new filesystem instances in the case where they use an on-disk
translation table created at format time.) As far as I know they have
never supported anything other than 1:1 conversion of BMP code points,
nor normalization.
The exFAT specification enumerates the full recommended upcase table,
although in a somewhat annoying format (basically a hex dump of
compressed data):
https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification

View File

@ -1,30 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0
bcachefs private error codes
----------------------------
In bcachefs, as a hard rule we do not throw or directly use standard error
codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
in fs/bcachefs/errcode.h.
This gives us much better error messages and makes debugging much easier. Any
direct uses of standard error codes you see in the source code are simply old
code that has yet to be converted - feel free to clean it up!
Private error codes may subtype another error code, this allows for grouping of
related errors that should be handled similarly (e.g. transaction restart
errors), as well as specifying which standard error code should be returned at
the bcachefs module boundary.
At the module boundary, we use bch2_err_class() to convert to a standard error
code; this also emits a trace event so that the original error code be
recovered even if it wasn't logged.
Do not reuse error codes! Generally speaking, a private error code should only
be thrown in one place. That means that when we see it in a log message we can
see, unambiguously, exactly which file and line number it was returned from.
Try to give error codes names that are as reasonably descriptive of the error
as possible. Frequently, the error will be logged at a place far removed from
where the error was generated; good names for error codes mean much more
descriptive and useful error messages.

View File

@ -1,78 +0,0 @@
Idle/background work classes design doc:
Right now, our behaviour at idle isn't ideal, it was designed for servers that
would be under sustained load, to keep pending work at a "medium" level, to
let work build up so we can process it in more efficient batches, while also
giving headroom for bursts in load.
But for desktops or mobile - scenarios where work is less sustained and power
usage is more important - we want to operate differently, with a "rush to
idle" so the system can go to sleep. We don't want to be dribbling out
background work while the system should be idle.
The complicating factor is that there are a number of background tasks, which
form a heirarchy (or a digraph, depending on how you divide it up) - one
background task may generate work for another.
Thus proper idle detection needs to model this heirarchy.
- Foreground writes
- Page cache writeback
- Copygc, rebalance
- Journal reclaim
When we implement idle detection and rush to idle, we need to be careful not
to disturb too much the existing behaviour that works reasonably well when the
system is under sustained load (or perhaps improve it in the case of
rebalance, which currently does not actively attempt to let work batch up).
SUSTAINED LOAD REGIME
---------------------
When the system is under continuous load, we want these jobs to run
continuously - this is perhaps best modelled with a P/D controller, where
they'll be trying to keep a target value (i.e. fragmented disk space,
available journal space) roughly in the middle of some range.
The goal under sustained load is to balance our ability to handle load spikes
without running out of x resource (free disk space, free space in the
journal), while also letting some work accumululate to be batched (or become
unnecessary).
For example, we don't want to run copygc too aggressively, because then it
will be evacuating buckets that would have become empty (been overwritten or
deleted) anyways, and we don't want to wait until we're almost out of free
space because then the system will behave unpredicably - suddenly we're doing
a lot more work to service each write and the system becomes much slower.
IDLE REGIME
-----------
When the system becomes idle, we should start flushing our pending work
quicker so the system can go to sleep.
Note that the definition of "idle" depends on where in the heirarchy a task
is - a task should start flushing work more quickly when the task above it has
stopped generating new work.
e.g. rebalance should start flushing more quickly when page cache writeback is
idle, and journal reclaim should only start flushing more quickly when both
copygc and rebalance are idle.
It's important to let work accumulate when more work is still incoming and we
still have room, because flushing is always more efficient if we let it batch
up. New writes may overwrite data before rebalance moves it, and tasks may be
generating more updates for the btree nodes that journal reclaim needs to flush.
On idle, how much work we do at each interval should be proportional to the
length of time we have been idle for. If we're idle only for a short duration,
we shouldn't flush everything right away; the system might wake up and start
generating new work soon, and flushing immediately might end up doing a lot of
work that would have been unnecessary if we'd allowed things to batch more.
To summarize, we will need:
- A list of classes for background tasks that generate work, which will
include one "foreground" class.
- Tracking for each class - "Am I doing work, or have I gone to sleep?"
- And each class should check the class above it when deciding how much work to issue.

View File

@ -1,38 +0,0 @@
.. SPDX-License-Identifier: GPL-2.0
======================
bcachefs Documentation
======================
Subsystem-specific development process notes
--------------------------------------------
Development notes specific to bcachefs. These are intended to supplement
:doc:`general kernel development handbook </process/index>`.
.. toctree::
:maxdepth: 1
:numbered:
CodingStyle
SubmittingPatches
Filesystem implementation
-------------------------
Documentation for filesystem features and their implementation details.
At this moment, only a few of these are described here.
.. toctree::
:maxdepth: 1
:numbered:
casefolding
errorcodes
Future design
-------------
.. toctree::
:maxdepth: 1
future/idle_work

View File

@ -72,7 +72,6 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
bcachefs/index
befs
bfs
btrfs

View File

@ -4217,10 +4217,7 @@ M: Kent Overstreet <kent.overstreet@linux.dev>
L: linux-bcachefs@vger.kernel.org
S: Externally maintained
C: irc://irc.oftc.net/bcache
P: Documentation/filesystems/bcachefs/SubmittingPatches.rst
T: git https://evilpiepirate.org/git/bcachefs.git
F: fs/bcachefs/
F: Documentation/filesystems/bcachefs/
BDISP ST MEDIA DRIVER
M: Fabien Dessenne <fabien.dessenne@foss.st.com>

View File

@ -454,7 +454,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -411,7 +411,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -431,7 +431,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -403,7 +403,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -413,7 +413,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -430,7 +430,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -517,7 +517,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -403,7 +403,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -404,7 +404,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -420,7 +420,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -401,7 +401,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -401,7 +401,6 @@ CONFIG_XFS_FS=m
CONFIG_OCFS2_FS=m
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
CONFIG_BTRFS_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_FANOTIFY=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_AUTOFS_FS=m

View File

@ -658,9 +658,6 @@ CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_BTRFS_DEBUG=y
CONFIG_BTRFS_ASSERT=y
CONFIG_NILFS2_FS=m
CONFIG_BCACHEFS_FS=y
CONFIG_BCACHEFS_QUOTA=y
CONFIG_BCACHEFS_POSIX_ACL=y
CONFIG_FS_DAX=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FS_ENCRYPTION=y

View File

@ -645,9 +645,6 @@ CONFIG_OCFS2_FS=m
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_NILFS2_FS=m
CONFIG_BCACHEFS_FS=m
CONFIG_BCACHEFS_QUOTA=y
CONFIG_BCACHEFS_POSIX_ACL=y
CONFIG_FS_DAX=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FS_ENCRYPTION=y

View File

@ -51,7 +51,6 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"
source "fs/f2fs/Kconfig"
source "fs/bcachefs/Kconfig"
source "fs/zonefs/Kconfig"
endif # BLOCK

View File

@ -121,7 +121,6 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_F2FS_FS) += f2fs/
obj-$(CONFIG_BCACHEFS_FS) += bcachefs/
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/

View File

@ -1,121 +0,0 @@
config BCACHEFS_FS
tristate "bcachefs filesystem support (EXPERIMENTAL)"
depends on BLOCK
select EXPORTFS
select CLOSURES
select CRC32
select CRC64
select FS_POSIX_ACL
select LZ4_COMPRESS
select LZ4_DECOMPRESS
select LZ4HC_COMPRESS
select LZ4HC_DECOMPRESS
select ZLIB_DEFLATE
select ZLIB_INFLATE
select ZSTD_COMPRESS
select ZSTD_DECOMPRESS
select CRYPTO_LIB_SHA256
select CRYPTO_LIB_CHACHA
select CRYPTO_LIB_POLY1305
select KEYS
select RAID6_PQ
select XOR_BLOCKS
select XXHASH
select SRCU
select SYMBOLIC_ERRNAME
select MIN_HEAP
select XARRAY_MULTI
help
The bcachefs filesystem - a modern, copy on write filesystem, with
support for multiple devices, compression, checksumming, etc.
config BCACHEFS_QUOTA
bool "bcachefs quota support"
depends on BCACHEFS_FS
select QUOTACTL
config BCACHEFS_ERASURE_CODING
bool "bcachefs erasure coding (RAID5/6) support (EXPERIMENTAL)"
depends on BCACHEFS_FS
select QUOTACTL
help
This enables the "erasure_code" filesysystem and inode option, which
organizes data into reed-solomon stripes instead of ordinary
replication.
WARNING: this feature is still undergoing on disk format changes, and
should only be enabled for testing purposes.
config BCACHEFS_POSIX_ACL
bool "bcachefs POSIX ACL support"
depends on BCACHEFS_FS
select FS_POSIX_ACL
config BCACHEFS_DEBUG
bool "bcachefs debugging"
depends on BCACHEFS_FS
help
Enables many extra debugging checks and assertions.
The resulting code will be significantly slower than normal; you
probably shouldn't select this option unless you're a developer.
config BCACHEFS_INJECT_TRANSACTION_RESTARTS
bool "Randomly inject transaction restarts"
depends on BCACHEFS_DEBUG
help
Randomly inject transaction restarts in a few core paths - may have a
significant performance penalty
config BCACHEFS_TESTS
bool "bcachefs unit and performance tests"
depends on BCACHEFS_FS
help
Include some unit and performance tests for the core btree code
config BCACHEFS_LOCK_TIME_STATS
bool "bcachefs lock time statistics"
depends on BCACHEFS_FS
help
Expose statistics for how long we held a lock in debugfs
config BCACHEFS_NO_LATENCY_ACCT
bool "disable latency accounting and time stats"
depends on BCACHEFS_FS
help
This disables device latency tracking and time stats, only for performance testing
config BCACHEFS_SIX_OPTIMISTIC_SPIN
bool "Optimistic spinning for six locks"
depends on BCACHEFS_FS
depends on SMP
default y
help
Instead of immediately sleeping when attempting to take a six lock that
is held by another thread, spin for a short while, as long as the
thread owning the lock is running.
config BCACHEFS_PATH_TRACEPOINTS
bool "Extra btree_path tracepoints"
depends on BCACHEFS_FS && TRACING
help
Enable extra tracepoints for debugging btree_path operations; we don't
normally want these enabled because they happen at very high rates.
config BCACHEFS_TRANS_KMALLOC_TRACE
bool "Trace bch2_trans_kmalloc() calls"
depends on BCACHEFS_FS
config BCACHEFS_ASYNC_OBJECT_LISTS
bool "Keep async objects on fast_lists for debugfs visibility"
depends on BCACHEFS_FS && DEBUG_FS
config MEAN_AND_VARIANCE_UNIT_TEST
tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
depends on KUNIT
depends on BCACHEFS_FS
default KUNIT_ALL_TESTS
help
This option enables the kunit tests for mean_and_variance module.
If unsure, say N.

View File

@ -1,107 +0,0 @@
obj-$(CONFIG_BCACHEFS_FS) += bcachefs.o
bcachefs-y := \
acl.o \
alloc_background.o \
alloc_foreground.o \
backpointers.o \
bkey.o \
bkey_methods.o \
bkey_sort.o \
bset.o \
btree_cache.o \
btree_gc.o \
btree_io.o \
btree_iter.o \
btree_journal_iter.o \
btree_key_cache.o \
btree_locking.o \
btree_node_scan.o \
btree_trans_commit.o \
btree_update.o \
btree_update_interior.o \
btree_write_buffer.o \
buckets.o \
buckets_waiting_for_journal.o \
chardev.o \
checksum.o \
clock.o \
compress.o \
darray.o \
data_update.o \
debug.o \
dirent.o \
disk_accounting.o \
disk_groups.o \
ec.o \
enumerated_ref.o \
errcode.o \
error.o \
extents.o \
extent_update.o \
eytzinger.o \
fast_list.o \
fs.o \
fs-ioctl.o \
fs-io.o \
fs-io-buffered.o \
fs-io-direct.o \
fs-io-pagecache.o \
fsck.o \
inode.o \
io_read.o \
io_misc.o \
io_write.o \
journal.o \
journal_io.o \
journal_reclaim.o \
journal_sb.o \
journal_seq_blacklist.o \
keylist.o \
logged_ops.o \
lru.o \
mean_and_variance.o \
migrate.o \
move.o \
movinggc.o \
namei.o \
nocow_locking.o \
opts.o \
printbuf.o \
progress.o \
quota.o \
rebalance.o \
rcu_pending.o \
recovery.o \
recovery_passes.o \
reflink.o \
replicas.o \
sb-clean.o \
sb-counters.o \
sb-downgrade.o \
sb-errors.o \
sb-members.o \
siphash.o \
six.o \
snapshot.o \
str_hash.o \
subvolume.o \
super.o \
super-io.o \
sysfs.o \
tests.o \
time_stats.o \
thread_with_file.o \
trace.o \
two_state_shared_lock.o \
util.o \
varint.o \
xattr.o
bcachefs-$(CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS) += async_objs.o
obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST) += mean_and_variance_test.o
# Silence "note: xyz changed in GCC X.X" messages
subdir-ccflags-y += $(call cc-disable-warning, psabi)

View File

@ -1,445 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "acl.h"
#include "xattr.h"
#include <linux/posix_acl.h>
static const char * const acl_types[] = {
[ACL_USER_OBJ] = "user_obj",
[ACL_USER] = "user",
[ACL_GROUP_OBJ] = "group_obj",
[ACL_GROUP] = "group",
[ACL_MASK] = "mask",
[ACL_OTHER] = "other",
NULL,
};
void bch2_acl_to_text(struct printbuf *out, const void *value, size_t size)
{
const void *p, *end = value + size;
if (!value ||
size < sizeof(bch_acl_header) ||
((bch_acl_header *)value)->a_version != cpu_to_le32(BCH_ACL_VERSION))
return;
p = value + sizeof(bch_acl_header);
while (p < end) {
const bch_acl_entry *in = p;
unsigned tag = le16_to_cpu(in->e_tag);
prt_str(out, acl_types[tag]);
switch (tag) {
case ACL_USER_OBJ:
case ACL_GROUP_OBJ:
case ACL_MASK:
case ACL_OTHER:
p += sizeof(bch_acl_entry_short);
break;
case ACL_USER:
prt_printf(out, " uid %u", le32_to_cpu(in->e_id));
p += sizeof(bch_acl_entry);
break;
case ACL_GROUP:
prt_printf(out, " gid %u", le32_to_cpu(in->e_id));
p += sizeof(bch_acl_entry);
break;
}
prt_printf(out, " %o", le16_to_cpu(in->e_perm));
if (p != end)
prt_char(out, ' ');
}
}
#ifdef CONFIG_BCACHEFS_POSIX_ACL
#include "fs.h"
#include <linux/fs.h>
#include <linux/posix_acl_xattr.h>
#include <linux/sched.h>
#include <linux/slab.h>
static inline size_t bch2_acl_size(unsigned nr_short, unsigned nr_long)
{
return sizeof(bch_acl_header) +
sizeof(bch_acl_entry_short) * nr_short +
sizeof(bch_acl_entry) * nr_long;
}
static inline int acl_to_xattr_type(int type)
{
switch (type) {
case ACL_TYPE_ACCESS:
return KEY_TYPE_XATTR_INDEX_POSIX_ACL_ACCESS;
case ACL_TYPE_DEFAULT:
return KEY_TYPE_XATTR_INDEX_POSIX_ACL_DEFAULT;
default:
BUG();
}
}
/*
* Convert from filesystem to in-memory representation.
*/
static struct posix_acl *bch2_acl_from_disk(struct btree_trans *trans,
const void *value, size_t size)
{
const void *p, *end = value + size;
struct posix_acl *acl;
struct posix_acl_entry *out;
unsigned count = 0;
int ret;
if (!value)
return NULL;
if (size < sizeof(bch_acl_header))
goto invalid;
if (((bch_acl_header *)value)->a_version !=
cpu_to_le32(BCH_ACL_VERSION))
goto invalid;
p = value + sizeof(bch_acl_header);
while (p < end) {
const bch_acl_entry *entry = p;
if (p + sizeof(bch_acl_entry_short) > end)
goto invalid;
switch (le16_to_cpu(entry->e_tag)) {
case ACL_USER_OBJ:
case ACL_GROUP_OBJ:
case ACL_MASK:
case ACL_OTHER:
p += sizeof(bch_acl_entry_short);
break;
case ACL_USER:
case ACL_GROUP:
p += sizeof(bch_acl_entry);
break;
default:
goto invalid;
}
count++;
}
if (p > end)
goto invalid;
if (!count)
return NULL;
acl = allocate_dropping_locks(trans, ret,
posix_acl_alloc(count, _gfp));
if (!acl)
return ERR_PTR(-ENOMEM);
if (ret) {
kfree(acl);
return ERR_PTR(ret);
}
out = acl->a_entries;
p = value + sizeof(bch_acl_header);
while (p < end) {
const bch_acl_entry *in = p;
out->e_tag = le16_to_cpu(in->e_tag);
out->e_perm = le16_to_cpu(in->e_perm);
switch (out->e_tag) {
case ACL_USER_OBJ:
case ACL_GROUP_OBJ:
case ACL_MASK:
case ACL_OTHER:
p += sizeof(bch_acl_entry_short);
break;
case ACL_USER:
out->e_uid = make_kuid(&init_user_ns,
le32_to_cpu(in->e_id));
p += sizeof(bch_acl_entry);
break;
case ACL_GROUP:
out->e_gid = make_kgid(&init_user_ns,
le32_to_cpu(in->e_id));
p += sizeof(bch_acl_entry);
break;
}
out++;
}
BUG_ON(out != acl->a_entries + acl->a_count);
return acl;
invalid:
pr_err("invalid acl entry");
return ERR_PTR(-EINVAL);
}
/*
* Convert from in-memory to filesystem representation.
*/
static struct bkey_i_xattr *
bch2_acl_to_xattr(struct btree_trans *trans,
const struct posix_acl *acl,
int type)
{
struct bkey_i_xattr *xattr;
bch_acl_header *acl_header;
const struct posix_acl_entry *acl_e, *pe;
void *outptr;
unsigned nr_short = 0, nr_long = 0, acl_len, u64s;
FOREACH_ACL_ENTRY(acl_e, acl, pe) {
switch (acl_e->e_tag) {
case ACL_USER:
case ACL_GROUP:
nr_long++;
break;
case ACL_USER_OBJ:
case ACL_GROUP_OBJ:
case ACL_MASK:
case ACL_OTHER:
nr_short++;
break;
default:
return ERR_PTR(-EINVAL);
}
}
acl_len = bch2_acl_size(nr_short, nr_long);
u64s = BKEY_U64s + xattr_val_u64s(0, acl_len);
if (u64s > U8_MAX)
return ERR_PTR(-E2BIG);
xattr = bch2_trans_kmalloc(trans, u64s * sizeof(u64));
if (IS_ERR(xattr))
return xattr;
bkey_xattr_init(&xattr->k_i);
xattr->k.u64s = u64s;
xattr->v.x_type = acl_to_xattr_type(type);
xattr->v.x_name_len = 0;
xattr->v.x_val_len = cpu_to_le16(acl_len);
acl_header = xattr_val(&xattr->v);
acl_header->a_version = cpu_to_le32(BCH_ACL_VERSION);
outptr = (void *) acl_header + sizeof(*acl_header);
FOREACH_ACL_ENTRY(acl_e, acl, pe) {
bch_acl_entry *entry = outptr;
entry->e_tag = cpu_to_le16(acl_e->e_tag);
entry->e_perm = cpu_to_le16(acl_e->e_perm);
switch (acl_e->e_tag) {
case ACL_USER:
entry->e_id = cpu_to_le32(
from_kuid(&init_user_ns, acl_e->e_uid));
outptr += sizeof(bch_acl_entry);
break;
case ACL_GROUP:
entry->e_id = cpu_to_le32(
from_kgid(&init_user_ns, acl_e->e_gid));
outptr += sizeof(bch_acl_entry);
break;
case ACL_USER_OBJ:
case ACL_GROUP_OBJ:
case ACL_MASK:
case ACL_OTHER:
outptr += sizeof(bch_acl_entry_short);
break;
}
}
BUG_ON(outptr != xattr_val(&xattr->v) + acl_len);
return xattr;
}
struct posix_acl *bch2_get_acl(struct inode *vinode, int type, bool rcu)
{
struct bch_inode_info *inode = to_bch_ei(vinode);
struct bch_fs *c = inode->v.i_sb->s_fs_info;
struct bch_hash_info hash = bch2_hash_info_init(c, &inode->ei_inode);
struct xattr_search_key search = X_SEARCH(acl_to_xattr_type(type), "", 0);
struct btree_iter iter = {};
struct posix_acl *acl = NULL;
if (rcu)
return ERR_PTR(-ECHILD);
struct btree_trans *trans = bch2_trans_get(c);
retry:
bch2_trans_begin(trans);
struct bkey_s_c k = bch2_hash_lookup(trans, &iter, bch2_xattr_hash_desc,
&hash, inode_inum(inode), &search, 0);
int ret = bkey_err(k);
if (ret)
goto err;
struct bkey_s_c_xattr xattr = bkey_s_c_to_xattr(k);
acl = bch2_acl_from_disk(trans, xattr_val(xattr.v),
le16_to_cpu(xattr.v->x_val_len));
ret = PTR_ERR_OR_ZERO(acl);
err:
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
goto retry;
if (ret)
acl = !bch2_err_matches(ret, ENOENT) ? ERR_PTR(ret) : NULL;
if (!IS_ERR_OR_NULL(acl))
set_cached_acl(&inode->v, type, acl);
bch2_trans_iter_exit(trans, &iter);
bch2_trans_put(trans);
return acl;
}
int bch2_set_acl_trans(struct btree_trans *trans, subvol_inum inum,
struct bch_inode_unpacked *inode_u,
struct posix_acl *acl, int type)
{
struct bch_hash_info hash_info = bch2_hash_info_init(trans->c, inode_u);
int ret;
if (type == ACL_TYPE_DEFAULT &&
!S_ISDIR(inode_u->bi_mode))
return acl ? -EACCES : 0;
if (acl) {
struct bkey_i_xattr *xattr =
bch2_acl_to_xattr(trans, acl, type);
if (IS_ERR(xattr))
return PTR_ERR(xattr);
ret = bch2_hash_set(trans, bch2_xattr_hash_desc, &hash_info,
inum, &xattr->k_i, 0);
} else {
struct xattr_search_key search =
X_SEARCH(acl_to_xattr_type(type), "", 0);
ret = bch2_hash_delete(trans, bch2_xattr_hash_desc, &hash_info,
inum, &search);
}
return bch2_err_matches(ret, ENOENT) ? 0 : ret;
}
int bch2_set_acl(struct mnt_idmap *idmap,
struct dentry *dentry,
struct posix_acl *_acl, int type)
{
struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
struct bch_fs *c = inode->v.i_sb->s_fs_info;
struct btree_iter inode_iter = {};
struct bch_inode_unpacked inode_u;
struct posix_acl *acl;
umode_t mode;
int ret;
mutex_lock(&inode->ei_update_lock);
struct btree_trans *trans = bch2_trans_get(c);
retry:
bch2_trans_begin(trans);
acl = _acl;
ret = bch2_subvol_is_ro_trans(trans, inode->ei_inum.subvol) ?:
bch2_inode_peek(trans, &inode_iter, &inode_u, inode_inum(inode),
BTREE_ITER_intent);
if (ret)
goto btree_err;
mode = inode_u.bi_mode;
if (type == ACL_TYPE_ACCESS) {
ret = posix_acl_update_mode(idmap, &inode->v, &mode, &acl);
if (ret)
goto btree_err;
}
ret = bch2_set_acl_trans(trans, inode_inum(inode), &inode_u, acl, type);
if (ret)
goto btree_err;
inode_u.bi_ctime = bch2_current_time(c);
inode_u.bi_mode = mode;
ret = bch2_inode_write(trans, &inode_iter, &inode_u) ?:
bch2_trans_commit(trans, NULL, NULL, 0);
btree_err:
bch2_trans_iter_exit(trans, &inode_iter);
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
goto retry;
if (unlikely(ret))
goto err;
bch2_inode_update_after_write(trans, inode, &inode_u,
ATTR_CTIME|ATTR_MODE);
set_cached_acl(&inode->v, type, acl);
err:
bch2_trans_put(trans);
mutex_unlock(&inode->ei_update_lock);
return ret;
}
int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum,
struct bch_inode_unpacked *inode,
umode_t mode,
struct posix_acl **new_acl)
{
struct bch_hash_info hash_info = bch2_hash_info_init(trans->c, inode);
struct xattr_search_key search = X_SEARCH(KEY_TYPE_XATTR_INDEX_POSIX_ACL_ACCESS, "", 0);
struct btree_iter iter;
struct posix_acl *acl = NULL;
struct bkey_s_c k = bch2_hash_lookup(trans, &iter, bch2_xattr_hash_desc,
&hash_info, inum, &search, BTREE_ITER_intent);
int ret = bkey_err(k);
if (ret)
return bch2_err_matches(ret, ENOENT) ? 0 : ret;
struct bkey_s_c_xattr xattr = bkey_s_c_to_xattr(k);
acl = bch2_acl_from_disk(trans, xattr_val(xattr.v),
le16_to_cpu(xattr.v->x_val_len));
ret = PTR_ERR_OR_ZERO(acl);
if (ret)
goto err;
ret = allocate_dropping_locks_errcode(trans, __posix_acl_chmod(&acl, _gfp, mode));
if (ret)
goto err;
struct bkey_i_xattr *new = bch2_acl_to_xattr(trans, acl, ACL_TYPE_ACCESS);
ret = PTR_ERR_OR_ZERO(new);
if (ret)
goto err;
new->k.p = iter.pos;
ret = bch2_trans_update(trans, &iter, &new->k_i, 0);
*new_acl = acl;
acl = NULL;
err:
bch2_trans_iter_exit(trans, &iter);
if (!IS_ERR_OR_NULL(acl))
kfree(acl);
return ret;
}
#endif /* CONFIG_BCACHEFS_POSIX_ACL */

View File

@ -1,60 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ACL_H
#define _BCACHEFS_ACL_H
struct bch_inode_unpacked;
struct bch_hash_info;
struct bch_inode_info;
struct posix_acl;
#define BCH_ACL_VERSION 0x0001
typedef struct {
__le16 e_tag;
__le16 e_perm;
__le32 e_id;
} bch_acl_entry;
typedef struct {
__le16 e_tag;
__le16 e_perm;
} bch_acl_entry_short;
typedef struct {
__le32 a_version;
} bch_acl_header;
void bch2_acl_to_text(struct printbuf *, const void *, size_t);
#ifdef CONFIG_BCACHEFS_POSIX_ACL
struct posix_acl *bch2_get_acl(struct inode *, int, bool);
int bch2_set_acl_trans(struct btree_trans *, subvol_inum,
struct bch_inode_unpacked *,
struct posix_acl *, int);
int bch2_set_acl(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
int bch2_acl_chmod(struct btree_trans *, subvol_inum,
struct bch_inode_unpacked *,
umode_t, struct posix_acl **);
#else
static inline int bch2_set_acl_trans(struct btree_trans *trans, subvol_inum inum,
struct bch_inode_unpacked *inode_u,
struct posix_acl *acl, int type)
{
return 0;
}
static inline int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum,
struct bch_inode_unpacked *inode,
umode_t mode,
struct posix_acl **new_acl)
{
return 0;
}
#endif /* CONFIG_BCACHEFS_POSIX_ACL */
#endif /* _BCACHEFS_ACL_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,361 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ALLOC_BACKGROUND_H
#define _BCACHEFS_ALLOC_BACKGROUND_H
#include "bcachefs.h"
#include "alloc_types.h"
#include "buckets.h"
#include "debug.h"
#include "super.h"
/* How out of date a pointer gen is allowed to be: */
#define BUCKET_GC_GEN_MAX 96U
static inline bool bch2_dev_bucket_exists(struct bch_fs *c, struct bpos pos)
{
guard(rcu)();
struct bch_dev *ca = bch2_dev_rcu_noerror(c, pos.inode);
return ca && bucket_valid(ca, pos.offset);
}
static inline u64 bucket_to_u64(struct bpos bucket)
{
return (bucket.inode << 48) | bucket.offset;
}
static inline struct bpos u64_to_bucket(u64 bucket)
{
return POS(bucket >> 48, bucket & ~(~0ULL << 48));
}
static inline u8 alloc_gc_gen(struct bch_alloc_v4 a)
{
return a.gen - a.oldest_gen;
}
static inline void alloc_to_bucket(struct bucket *dst, struct bch_alloc_v4 src)
{
dst->gen = src.gen;
dst->data_type = src.data_type;
dst->stripe_sectors = src.stripe_sectors;
dst->dirty_sectors = src.dirty_sectors;
dst->cached_sectors = src.cached_sectors;
dst->stripe = src.stripe;
}
static inline void __bucket_m_to_alloc(struct bch_alloc_v4 *dst, struct bucket src)
{
dst->gen = src.gen;
dst->data_type = src.data_type;
dst->stripe_sectors = src.stripe_sectors;
dst->dirty_sectors = src.dirty_sectors;
dst->cached_sectors = src.cached_sectors;
dst->stripe = src.stripe;
}
static inline struct bch_alloc_v4 bucket_m_to_alloc(struct bucket b)
{
struct bch_alloc_v4 ret = {};
__bucket_m_to_alloc(&ret, b);
return ret;
}
static inline enum bch_data_type bucket_data_type(enum bch_data_type data_type)
{
switch (data_type) {
case BCH_DATA_cached:
case BCH_DATA_stripe:
return BCH_DATA_user;
default:
return data_type;
}
}
static inline bool bucket_data_type_mismatch(enum bch_data_type bucket,
enum bch_data_type ptr)
{
return !data_type_is_empty(bucket) &&
bucket_data_type(bucket) != bucket_data_type(ptr);
}
/*
* It is my general preference to use unsigned types for unsigned quantities -
* however, these helpers are used in disk accounting calculations run by
* triggers where the output will be negated and added to an s64. unsigned is
* right out even though all these quantities will fit in 32 bits, since it
* won't be sign extended correctly; u64 will negate "correctly", but s64 is the
* simpler option here.
*/
static inline s64 bch2_bucket_sectors_total(struct bch_alloc_v4 a)
{
return a.stripe_sectors + a.dirty_sectors + a.cached_sectors;
}
static inline s64 bch2_bucket_sectors_dirty(struct bch_alloc_v4 a)
{
return a.stripe_sectors + a.dirty_sectors;
}
static inline s64 bch2_bucket_sectors(struct bch_alloc_v4 a)
{
return a.data_type == BCH_DATA_cached
? a.cached_sectors
: bch2_bucket_sectors_dirty(a);
}
static inline s64 bch2_bucket_sectors_fragmented(struct bch_dev *ca,
struct bch_alloc_v4 a)
{
int d = bch2_bucket_sectors(a);
return d ? max(0, ca->mi.bucket_size - d) : 0;
}
static inline s64 bch2_gc_bucket_sectors_fragmented(struct bch_dev *ca, struct bucket a)
{
int d = a.stripe_sectors + a.dirty_sectors;
return d ? max(0, ca->mi.bucket_size - d) : 0;
}
static inline s64 bch2_bucket_sectors_unstriped(struct bch_alloc_v4 a)
{
return a.data_type == BCH_DATA_stripe ? a.dirty_sectors : 0;
}
static inline enum bch_data_type alloc_data_type(struct bch_alloc_v4 a,
enum bch_data_type data_type)
{
if (a.stripe)
return data_type == BCH_DATA_parity ? data_type : BCH_DATA_stripe;
if (bch2_bucket_sectors_dirty(a))
return bucket_data_type(data_type);
if (a.cached_sectors)
return BCH_DATA_cached;
if (BCH_ALLOC_V4_NEED_DISCARD(&a))
return BCH_DATA_need_discard;
if (alloc_gc_gen(a) >= BUCKET_GC_GEN_MAX)
return BCH_DATA_need_gc_gens;
return BCH_DATA_free;
}
static inline void alloc_data_type_set(struct bch_alloc_v4 *a, enum bch_data_type data_type)
{
a->data_type = alloc_data_type(*a, data_type);
}
static inline u64 alloc_lru_idx_read(struct bch_alloc_v4 a)
{
return a.data_type == BCH_DATA_cached
? a.io_time[READ] & LRU_TIME_MAX
: 0;
}
#define DATA_TYPES_MOVABLE \
((1U << BCH_DATA_btree)| \
(1U << BCH_DATA_user)| \
(1U << BCH_DATA_stripe))
static inline bool data_type_movable(enum bch_data_type type)
{
return (1U << type) & DATA_TYPES_MOVABLE;
}
static inline u64 alloc_lru_idx_fragmentation(struct bch_alloc_v4 a,
struct bch_dev *ca)
{
if (a.data_type >= BCH_DATA_NR)
return 0;
if (!data_type_movable(a.data_type) ||
!bch2_bucket_sectors_fragmented(ca, a))
return 0;
/*
* avoid overflowing LRU_TIME_BITS on a corrupted fs, when
* bucket_sectors_dirty is (much) bigger than bucket_size
*/
u64 d = min_t(s64, bch2_bucket_sectors_dirty(a),
ca->mi.bucket_size);
return div_u64(d * (1ULL << 31), ca->mi.bucket_size);
}
static inline u64 alloc_freespace_genbits(struct bch_alloc_v4 a)
{
return ((u64) alloc_gc_gen(a) >> 4) << 56;
}
static inline struct bpos alloc_freespace_pos(struct bpos pos, struct bch_alloc_v4 a)
{
pos.offset |= alloc_freespace_genbits(a);
return pos;
}
static inline unsigned alloc_v4_u64s_noerror(const struct bch_alloc_v4 *a)
{
return (BCH_ALLOC_V4_BACKPOINTERS_START(a) ?:
BCH_ALLOC_V4_U64s_V0) +
BCH_ALLOC_V4_NR_BACKPOINTERS(a) *
(sizeof(struct bch_backpointer) / sizeof(u64));
}
static inline unsigned alloc_v4_u64s(const struct bch_alloc_v4 *a)
{
unsigned ret = alloc_v4_u64s_noerror(a);
BUG_ON(ret > U8_MAX - BKEY_U64s);
return ret;
}
static inline void set_alloc_v4_u64s(struct bkey_i_alloc_v4 *a)
{
set_bkey_val_u64s(&a->k, alloc_v4_u64s(&a->v));
}
struct bkey_i_alloc_v4 *
bch2_trans_start_alloc_update_noupdate(struct btree_trans *, struct btree_iter *, struct bpos);
struct bkey_i_alloc_v4 *
bch2_trans_start_alloc_update(struct btree_trans *, struct bpos,
enum btree_iter_update_trigger_flags);
void __bch2_alloc_to_v4(struct bkey_s_c, struct bch_alloc_v4 *);
static inline const struct bch_alloc_v4 *bch2_alloc_to_v4(struct bkey_s_c k, struct bch_alloc_v4 *convert)
{
const struct bch_alloc_v4 *ret;
if (unlikely(k.k->type != KEY_TYPE_alloc_v4))
goto slowpath;
ret = bkey_s_c_to_alloc_v4(k).v;
if (BCH_ALLOC_V4_BACKPOINTERS_START(ret) != BCH_ALLOC_V4_U64s)
goto slowpath;
return ret;
slowpath:
__bch2_alloc_to_v4(k, convert);
return convert;
}
struct bkey_i_alloc_v4 *bch2_alloc_to_v4_mut(struct btree_trans *, struct bkey_s_c);
int bch2_bucket_io_time_reset(struct btree_trans *, unsigned, size_t, int);
int bch2_alloc_v1_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int bch2_alloc_v2_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int bch2_alloc_v3_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int bch2_alloc_v4_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
void bch2_alloc_v4_swab(struct bkey_s);
void bch2_alloc_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
void bch2_alloc_v4_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
#define bch2_bkey_ops_alloc ((struct bkey_ops) { \
.key_validate = bch2_alloc_v1_validate, \
.val_to_text = bch2_alloc_to_text, \
.trigger = bch2_trigger_alloc, \
.min_val_size = 8, \
})
#define bch2_bkey_ops_alloc_v2 ((struct bkey_ops) { \
.key_validate = bch2_alloc_v2_validate, \
.val_to_text = bch2_alloc_to_text, \
.trigger = bch2_trigger_alloc, \
.min_val_size = 8, \
})
#define bch2_bkey_ops_alloc_v3 ((struct bkey_ops) { \
.key_validate = bch2_alloc_v3_validate, \
.val_to_text = bch2_alloc_to_text, \
.trigger = bch2_trigger_alloc, \
.min_val_size = 16, \
})
#define bch2_bkey_ops_alloc_v4 ((struct bkey_ops) { \
.key_validate = bch2_alloc_v4_validate, \
.val_to_text = bch2_alloc_v4_to_text, \
.swab = bch2_alloc_v4_swab, \
.trigger = bch2_trigger_alloc, \
.min_val_size = 48, \
})
int bch2_bucket_gens_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
void bch2_bucket_gens_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
#define bch2_bkey_ops_bucket_gens ((struct bkey_ops) { \
.key_validate = bch2_bucket_gens_validate, \
.val_to_text = bch2_bucket_gens_to_text, \
})
int bch2_bucket_gens_init(struct bch_fs *);
static inline bool bkey_is_alloc(const struct bkey *k)
{
return k->type == KEY_TYPE_alloc ||
k->type == KEY_TYPE_alloc_v2 ||
k->type == KEY_TYPE_alloc_v3;
}
int bch2_alloc_read(struct bch_fs *);
int bch2_alloc_key_to_dev_counters(struct btree_trans *, struct bch_dev *,
const struct bch_alloc_v4 *,
const struct bch_alloc_v4 *, unsigned);
int bch2_trigger_alloc(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_s,
enum btree_iter_update_trigger_flags);
int bch2_check_discard_freespace_key(struct btree_trans *, struct btree_iter *, u8 *, bool);
int bch2_check_alloc_info(struct bch_fs *);
int bch2_check_alloc_to_lru_refs(struct bch_fs *);
void bch2_dev_do_discards(struct bch_dev *);
void bch2_do_discards(struct bch_fs *);
static inline u64 should_invalidate_buckets(struct bch_dev *ca,
struct bch_dev_usage u)
{
u64 want_free = ca->mi.nbuckets >> 7;
u64 free = max_t(s64, 0,
u.buckets[BCH_DATA_free]
+ u.buckets[BCH_DATA_need_discard]
- bch2_dev_buckets_reserved(ca, BCH_WATERMARK_stripe));
return clamp_t(s64, want_free - free, 0, u.buckets[BCH_DATA_cached]);
}
void bch2_dev_do_invalidates(struct bch_dev *);
void bch2_do_invalidates(struct bch_fs *);
static inline struct bch_backpointer *alloc_v4_backpointers(struct bch_alloc_v4 *a)
{
return (void *) ((u64 *) &a->v +
(BCH_ALLOC_V4_BACKPOINTERS_START(a) ?:
BCH_ALLOC_V4_U64s_V0));
}
static inline const struct bch_backpointer *alloc_v4_backpointers_c(const struct bch_alloc_v4 *a)
{
return (void *) ((u64 *) &a->v + BCH_ALLOC_V4_BACKPOINTERS_START(a));
}
int bch2_dev_freespace_init(struct bch_fs *, struct bch_dev *, u64, u64);
int bch2_fs_freespace_init(struct bch_fs *);
int bch2_dev_remove_alloc(struct bch_fs *, struct bch_dev *);
void bch2_recalc_capacity(struct bch_fs *);
u64 bch2_min_rw_member_capacity(struct bch_fs *);
void bch2_dev_allocator_set_rw(struct bch_fs *, struct bch_dev *, bool);
void bch2_dev_allocator_remove(struct bch_fs *, struct bch_dev *);
void bch2_dev_allocator_add(struct bch_fs *, struct bch_dev *);
void bch2_dev_allocator_background_exit(struct bch_dev *);
void bch2_dev_allocator_background_init(struct bch_dev *);
void bch2_fs_allocator_background_init(struct bch_fs *);
#endif /* _BCACHEFS_ALLOC_BACKGROUND_H */

View File

@ -1,95 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H
#define _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H
struct bch_alloc {
struct bch_val v;
__u8 fields;
__u8 gen;
__u8 data[];
} __packed __aligned(8);
#define BCH_ALLOC_FIELDS_V1() \
x(read_time, 16) \
x(write_time, 16) \
x(data_type, 8) \
x(dirty_sectors, 16) \
x(cached_sectors, 16) \
x(oldest_gen, 8) \
x(stripe, 32) \
x(stripe_redundancy, 8)
enum {
#define x(name, _bits) BCH_ALLOC_FIELD_V1_##name,
BCH_ALLOC_FIELDS_V1()
#undef x
};
struct bch_alloc_v2 {
struct bch_val v;
__u8 nr_fields;
__u8 gen;
__u8 oldest_gen;
__u8 data_type;
__u8 data[];
} __packed __aligned(8);
#define BCH_ALLOC_FIELDS_V2() \
x(read_time, 64) \
x(write_time, 64) \
x(dirty_sectors, 32) \
x(cached_sectors, 32) \
x(stripe, 32) \
x(stripe_redundancy, 8)
struct bch_alloc_v3 {
struct bch_val v;
__le64 journal_seq;
__le32 flags;
__u8 nr_fields;
__u8 gen;
__u8 oldest_gen;
__u8 data_type;
__u8 data[];
} __packed __aligned(8);
LE32_BITMASK(BCH_ALLOC_V3_NEED_DISCARD,struct bch_alloc_v3, flags, 0, 1)
LE32_BITMASK(BCH_ALLOC_V3_NEED_INC_GEN,struct bch_alloc_v3, flags, 1, 2)
struct bch_alloc_v4 {
struct bch_val v;
__u64 journal_seq_nonempty;
__u32 flags;
__u8 gen;
__u8 oldest_gen;
__u8 data_type;
__u8 stripe_redundancy;
__u32 dirty_sectors;
__u32 cached_sectors;
__u64 io_time[2];
__u32 stripe;
__u32 nr_external_backpointers;
/* end of fields in original version of alloc_v4 */
__u64 journal_seq_empty;
__u32 stripe_sectors;
__u32 pad;
} __packed __aligned(8);
#define BCH_ALLOC_V4_U64s_V0 6
#define BCH_ALLOC_V4_U64s (sizeof(struct bch_alloc_v4) / sizeof(__u64))
BITMASK(BCH_ALLOC_V4_NEED_DISCARD, struct bch_alloc_v4, flags, 0, 1)
BITMASK(BCH_ALLOC_V4_NEED_INC_GEN, struct bch_alloc_v4, flags, 1, 2)
BITMASK(BCH_ALLOC_V4_BACKPOINTERS_START,struct bch_alloc_v4, flags, 2, 8)
BITMASK(BCH_ALLOC_V4_NR_BACKPOINTERS, struct bch_alloc_v4, flags, 8, 14)
#define KEY_TYPE_BUCKET_GENS_BITS 8
#define KEY_TYPE_BUCKET_GENS_NR (1U << KEY_TYPE_BUCKET_GENS_BITS)
#define KEY_TYPE_BUCKET_GENS_MASK (KEY_TYPE_BUCKET_GENS_NR - 1)
struct bch_bucket_gens {
struct bch_val v;
u8 gens[KEY_TYPE_BUCKET_GENS_NR];
} __packed __aligned(8);
#endif /* _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,318 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ALLOC_FOREGROUND_H
#define _BCACHEFS_ALLOC_FOREGROUND_H
#include "bcachefs.h"
#include "buckets.h"
#include "alloc_types.h"
#include "extents.h"
#include "io_write_types.h"
#include "sb-members.h"
#include <linux/hash.h>
struct bkey;
struct bch_dev;
struct bch_fs;
struct bch_devs_List;
extern const char * const bch2_watermarks[];
void bch2_reset_alloc_cursors(struct bch_fs *);
struct dev_alloc_list {
unsigned nr;
u8 data[BCH_SB_MEMBERS_MAX];
};
struct alloc_request {
unsigned nr_replicas;
unsigned target;
bool ec;
enum bch_watermark watermark;
enum bch_write_flags flags;
enum bch_data_type data_type;
struct bch_devs_list *devs_have;
struct write_point *wp;
/* These fields are used primarily by open_bucket_add_buckets */
struct open_buckets ptrs;
unsigned nr_effective; /* sum of @ptrs durability */
bool have_cache; /* have we allocated from a 0 durability dev */
struct bch_devs_mask devs_may_alloc;
/* bch2_bucket_alloc_set_trans(): */
struct dev_alloc_list devs_sorted;
struct bch_dev_usage usage;
/* bch2_bucket_alloc_trans(): */
struct bch_dev *ca;
enum {
BTREE_BITMAP_NO,
BTREE_BITMAP_YES,
BTREE_BITMAP_ANY,
} btree_bitmap;
struct {
u64 buckets_seen;
u64 skipped_open;
u64 skipped_need_journal_commit;
u64 need_journal_commit;
u64 skipped_nocow;
u64 skipped_nouse;
u64 skipped_mi_btree_bitmap;
} counters;
unsigned scratch_nr_replicas;
unsigned scratch_nr_effective;
bool scratch_have_cache;
enum bch_data_type scratch_data_type;
struct open_buckets scratch_ptrs;
struct bch_devs_mask scratch_devs_may_alloc;
};
void bch2_dev_alloc_list(struct bch_fs *,
struct dev_stripe_state *,
struct bch_devs_mask *,
struct dev_alloc_list *);
void bch2_dev_stripe_increment(struct bch_dev *, struct dev_stripe_state *);
static inline struct bch_dev *ob_dev(struct bch_fs *c, struct open_bucket *ob)
{
return bch2_dev_have_ref(c, ob->dev);
}
static inline unsigned bch2_open_buckets_reserved(enum bch_watermark watermark)
{
switch (watermark) {
case BCH_WATERMARK_interior_updates:
return 0;
case BCH_WATERMARK_reclaim:
return OPEN_BUCKETS_COUNT / 6;
case BCH_WATERMARK_btree:
case BCH_WATERMARK_btree_copygc:
return OPEN_BUCKETS_COUNT / 4;
case BCH_WATERMARK_copygc:
return OPEN_BUCKETS_COUNT / 3;
default:
return OPEN_BUCKETS_COUNT / 2;
}
}
struct open_bucket *bch2_bucket_alloc(struct bch_fs *, struct bch_dev *,
enum bch_watermark, enum bch_data_type,
struct closure *);
static inline void ob_push(struct bch_fs *c, struct open_buckets *obs,
struct open_bucket *ob)
{
BUG_ON(obs->nr >= ARRAY_SIZE(obs->v));
obs->v[obs->nr++] = ob - c->open_buckets;
}
#define open_bucket_for_each(_c, _obs, _ob, _i) \
for ((_i) = 0; \
(_i) < (_obs)->nr && \
((_ob) = (_c)->open_buckets + (_obs)->v[_i], true); \
(_i)++)
static inline struct open_bucket *ec_open_bucket(struct bch_fs *c,
struct open_buckets *obs)
{
struct open_bucket *ob;
unsigned i;
open_bucket_for_each(c, obs, ob, i)
if (ob->ec)
return ob;
return NULL;
}
void bch2_open_bucket_write_error(struct bch_fs *,
struct open_buckets *, unsigned, int);
void __bch2_open_bucket_put(struct bch_fs *, struct open_bucket *);
static inline void bch2_open_bucket_put(struct bch_fs *c, struct open_bucket *ob)
{
if (atomic_dec_and_test(&ob->pin))
__bch2_open_bucket_put(c, ob);
}
static inline void bch2_open_buckets_put(struct bch_fs *c,
struct open_buckets *ptrs)
{
struct open_bucket *ob;
unsigned i;
open_bucket_for_each(c, ptrs, ob, i)
bch2_open_bucket_put(c, ob);
ptrs->nr = 0;
}
static inline void bch2_alloc_sectors_done_inlined(struct bch_fs *c, struct write_point *wp)
{
struct open_buckets ptrs = { .nr = 0 }, keep = { .nr = 0 };
struct open_bucket *ob;
unsigned i;
open_bucket_for_each(c, &wp->ptrs, ob, i)
ob_push(c, ob->sectors_free < block_sectors(c)
? &ptrs
: &keep, ob);
wp->ptrs = keep;
mutex_unlock(&wp->lock);
bch2_open_buckets_put(c, &ptrs);
}
static inline void bch2_open_bucket_get(struct bch_fs *c,
struct write_point *wp,
struct open_buckets *ptrs)
{
struct open_bucket *ob;
unsigned i;
open_bucket_for_each(c, &wp->ptrs, ob, i) {
ob->data_type = wp->data_type;
atomic_inc(&ob->pin);
ob_push(c, ptrs, ob);
}
}
static inline open_bucket_idx_t *open_bucket_hashslot(struct bch_fs *c,
unsigned dev, u64 bucket)
{
return c->open_buckets_hash +
(jhash_3words(dev, bucket, bucket >> 32, 0) &
(OPEN_BUCKETS_COUNT - 1));
}
static inline bool bch2_bucket_is_open(struct bch_fs *c, unsigned dev, u64 bucket)
{
open_bucket_idx_t slot = *open_bucket_hashslot(c, dev, bucket);
while (slot) {
struct open_bucket *ob = &c->open_buckets[slot];
if (ob->dev == dev && ob->bucket == bucket)
return true;
slot = ob->hash;
}
return false;
}
static inline bool bch2_bucket_is_open_safe(struct bch_fs *c, unsigned dev, u64 bucket)
{
bool ret;
if (bch2_bucket_is_open(c, dev, bucket))
return true;
spin_lock(&c->freelist_lock);
ret = bch2_bucket_is_open(c, dev, bucket);
spin_unlock(&c->freelist_lock);
return ret;
}
enum bch_write_flags;
int bch2_bucket_alloc_set_trans(struct btree_trans *, struct alloc_request *,
struct dev_stripe_state *, struct closure *);
int bch2_alloc_sectors_start_trans(struct btree_trans *,
unsigned, unsigned,
struct write_point_specifier,
struct bch_devs_list *,
unsigned, unsigned,
enum bch_watermark,
enum bch_write_flags,
struct closure *,
struct write_point **);
static inline struct bch_extent_ptr bch2_ob_ptr(struct bch_fs *c, struct open_bucket *ob)
{
struct bch_dev *ca = ob_dev(c, ob);
return (struct bch_extent_ptr) {
.type = 1 << BCH_EXTENT_ENTRY_ptr,
.gen = ob->gen,
.dev = ob->dev,
.offset = bucket_to_sector(ca, ob->bucket) +
ca->mi.bucket_size -
ob->sectors_free,
};
}
/*
* Append pointers to the space we just allocated to @k, and mark @sectors space
* as allocated out of @ob
*/
static inline void
bch2_alloc_sectors_append_ptrs_inlined(struct bch_fs *c, struct write_point *wp,
struct bkey_i *k, unsigned sectors,
bool cached)
{
struct open_bucket *ob;
unsigned i;
BUG_ON(sectors > wp->sectors_free);
wp->sectors_free -= sectors;
wp->sectors_allocated += sectors;
open_bucket_for_each(c, &wp->ptrs, ob, i) {
struct bch_dev *ca = ob_dev(c, ob);
struct bch_extent_ptr ptr = bch2_ob_ptr(c, ob);
ptr.cached = cached ||
(!ca->mi.durability &&
wp->data_type == BCH_DATA_user);
bch2_bkey_append_ptr(k, ptr);
BUG_ON(sectors > ob->sectors_free);
ob->sectors_free -= sectors;
}
}
void bch2_alloc_sectors_append_ptrs(struct bch_fs *, struct write_point *,
struct bkey_i *, unsigned, bool);
void bch2_alloc_sectors_done(struct bch_fs *, struct write_point *);
void bch2_open_buckets_stop(struct bch_fs *c, struct bch_dev *, bool);
static inline struct write_point_specifier writepoint_hashed(unsigned long v)
{
return (struct write_point_specifier) { .v = v | 1 };
}
static inline struct write_point_specifier writepoint_ptr(struct write_point *wp)
{
return (struct write_point_specifier) { .v = (unsigned long) wp };
}
void bch2_fs_allocator_foreground_init(struct bch_fs *);
void bch2_open_bucket_to_text(struct printbuf *, struct bch_fs *, struct open_bucket *);
void bch2_open_buckets_to_text(struct printbuf *, struct bch_fs *, struct bch_dev *);
void bch2_open_buckets_partial_to_text(struct printbuf *, struct bch_fs *);
void bch2_write_points_to_text(struct printbuf *, struct bch_fs *);
void bch2_fs_alloc_debug_to_text(struct printbuf *, struct bch_fs *);
void bch2_dev_alloc_debug_to_text(struct printbuf *, struct bch_dev *);
void __bch2_wait_on_allocator(struct bch_fs *, struct closure *);
static inline void bch2_wait_on_allocator(struct bch_fs *c, struct closure *cl)
{
if (cl->closure_get_happened)
__bch2_wait_on_allocator(c, cl);
}
#endif /* _BCACHEFS_ALLOC_FOREGROUND_H */

View File

@ -1,121 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ALLOC_TYPES_H
#define _BCACHEFS_ALLOC_TYPES_H
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include "clock_types.h"
#include "fifo.h"
#define BCH_WATERMARKS() \
x(stripe) \
x(normal) \
x(copygc) \
x(btree) \
x(btree_copygc) \
x(reclaim) \
x(interior_updates)
enum bch_watermark {
#define x(name) BCH_WATERMARK_##name,
BCH_WATERMARKS()
#undef x
BCH_WATERMARK_NR,
};
#define BCH_WATERMARK_BITS 3
#define BCH_WATERMARK_MASK ~(~0U << BCH_WATERMARK_BITS)
#define OPEN_BUCKETS_COUNT 1024
#define WRITE_POINT_HASH_NR 32
#define WRITE_POINT_MAX 32
/*
* 0 is never a valid open_bucket_idx_t:
*/
typedef u16 open_bucket_idx_t;
struct open_bucket {
spinlock_t lock;
atomic_t pin;
open_bucket_idx_t freelist;
open_bucket_idx_t hash;
/*
* When an open bucket has an ec_stripe attached, this is the index of
* the block in the stripe this open_bucket corresponds to:
*/
u8 ec_idx;
enum bch_data_type data_type:6;
unsigned valid:1;
unsigned on_partial_list:1;
u8 dev;
u8 gen;
u32 sectors_free;
u64 bucket;
struct ec_stripe_new *ec;
};
#define OPEN_BUCKET_LIST_MAX 15
struct open_buckets {
open_bucket_idx_t nr;
open_bucket_idx_t v[OPEN_BUCKET_LIST_MAX];
};
struct dev_stripe_state {
u64 next_alloc[BCH_SB_MEMBERS_MAX];
};
#define WRITE_POINT_STATES() \
x(stopped) \
x(waiting_io) \
x(waiting_work) \
x(runnable) \
x(running)
enum write_point_state {
#define x(n) WRITE_POINT_##n,
WRITE_POINT_STATES()
#undef x
WRITE_POINT_STATE_NR
};
struct write_point {
struct {
struct hlist_node node;
struct mutex lock;
u64 last_used;
unsigned long write_point;
enum bch_data_type data_type;
/* calculated based on how many pointers we're actually going to use: */
unsigned sectors_free;
struct open_buckets ptrs;
struct dev_stripe_state stripe;
u64 sectors_allocated;
} __aligned(SMP_CACHE_BYTES);
struct {
struct work_struct index_update_work;
struct list_head writes;
spinlock_t writes_lock;
enum write_point_state state;
u64 last_state_change;
u64 time[WRITE_POINT_STATE_NR];
u64 last_runtime;
} __aligned(SMP_CACHE_BYTES);
};
struct write_point_specifier {
unsigned long v;
};
#endif /* _BCACHEFS_ALLOC_TYPES_H */

View File

@ -1,132 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
/*
* Async obj debugging: keep asynchronous objects on (very fast) lists, make
* them visibile in debugfs:
*/
#include "bcachefs.h"
#include "async_objs.h"
#include "btree_io.h"
#include "debug.h"
#include "io_read.h"
#include "io_write.h"
#include <linux/debugfs.h>
static void promote_obj_to_text(struct printbuf *out, void *obj)
{
bch2_promote_op_to_text(out, obj);
}
static void rbio_obj_to_text(struct printbuf *out, void *obj)
{
bch2_read_bio_to_text(out, obj);
}
static void write_op_obj_to_text(struct printbuf *out, void *obj)
{
bch2_write_op_to_text(out, obj);
}
static void btree_read_bio_obj_to_text(struct printbuf *out, void *obj)
{
struct btree_read_bio *rbio = obj;
bch2_btree_read_bio_to_text(out, rbio);
}
static void btree_write_bio_obj_to_text(struct printbuf *out, void *obj)
{
struct btree_write_bio *wbio = obj;
bch2_bio_to_text(out, &wbio->wbio.bio);
}
static int bch2_async_obj_list_open(struct inode *inode, struct file *file)
{
struct async_obj_list *list = inode->i_private;
struct dump_iter *i;
i = kzalloc(sizeof(struct dump_iter), GFP_KERNEL);
if (!i)
return -ENOMEM;
file->private_data = i;
i->from = POS_MIN;
i->iter = 0;
i->c = container_of(list, struct bch_fs, async_objs[list->idx]);
i->list = list;
i->buf = PRINTBUF;
return 0;
}
static ssize_t bch2_async_obj_list_read(struct file *file, char __user *buf,
size_t size, loff_t *ppos)
{
struct dump_iter *i = file->private_data;
struct async_obj_list *list = i->list;
ssize_t ret = 0;
i->ubuf = buf;
i->size = size;
i->ret = 0;
struct genradix_iter iter;
void *obj;
fast_list_for_each_from(&list->list, iter, obj, i->iter) {
ret = bch2_debugfs_flush_buf(i);
if (ret)
return ret;
if (!i->size)
break;
list->obj_to_text(&i->buf, obj);
}
if (i->buf.allocation_failure)
ret = -ENOMEM;
else
i->iter = iter.pos;
if (!ret)
ret = bch2_debugfs_flush_buf(i);
return ret ?: i->ret;
}
static const struct file_operations async_obj_ops = {
.owner = THIS_MODULE,
.open = bch2_async_obj_list_open,
.release = bch2_dump_release,
.read = bch2_async_obj_list_read,
};
void bch2_fs_async_obj_debugfs_init(struct bch_fs *c)
{
c->async_obj_dir = debugfs_create_dir("async_objs", c->fs_debug_dir);
#define x(n) debugfs_create_file(#n, 0400, c->async_obj_dir, \
&c->async_objs[BCH_ASYNC_OBJ_LIST_##n], &async_obj_ops);
BCH_ASYNC_OBJ_LISTS()
#undef x
}
void bch2_fs_async_obj_exit(struct bch_fs *c)
{
for (unsigned i = 0; i < ARRAY_SIZE(c->async_objs); i++)
fast_list_exit(&c->async_objs[i].list);
}
int bch2_fs_async_obj_init(struct bch_fs *c)
{
for (unsigned i = 0; i < ARRAY_SIZE(c->async_objs); i++) {
if (fast_list_init(&c->async_objs[i].list))
return -BCH_ERR_ENOMEM_async_obj_init;
c->async_objs[i].idx = i;
}
#define x(n) c->async_objs[BCH_ASYNC_OBJ_LIST_##n].obj_to_text = n##_obj_to_text;
BCH_ASYNC_OBJ_LISTS()
#undef x
return 0;
}

View File

@ -1,44 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ASYNC_OBJS_H
#define _BCACHEFS_ASYNC_OBJS_H
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
static inline void __async_object_list_del(struct fast_list *head, unsigned idx)
{
fast_list_remove(head, idx);
}
static inline int __async_object_list_add(struct fast_list *head, void *obj, unsigned *idx)
{
int ret = fast_list_add(head, obj);
*idx = ret > 0 ? ret : 0;
return ret < 0 ? ret : 0;
}
#define async_object_list_del(_c, _list, idx) \
__async_object_list_del(&(_c)->async_objs[BCH_ASYNC_OBJ_LIST_##_list].list, idx)
#define async_object_list_add(_c, _list, obj, idx) \
__async_object_list_add(&(_c)->async_objs[BCH_ASYNC_OBJ_LIST_##_list].list, obj, idx)
void bch2_fs_async_obj_debugfs_init(struct bch_fs *);
void bch2_fs_async_obj_exit(struct bch_fs *);
int bch2_fs_async_obj_init(struct bch_fs *);
#else /* CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS */
#define async_object_list_del(_c, _n, idx) do {} while (0)
static inline int __async_object_list_add(void)
{
return 0;
}
#define async_object_list_add(_c, _n, obj, idx) __async_object_list_add()
static inline void bch2_fs_async_obj_debugfs_init(struct bch_fs *c) {}
static inline void bch2_fs_async_obj_exit(struct bch_fs *c) {}
static inline int bch2_fs_async_obj_init(struct bch_fs *c) { return 0; }
#endif /* CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS */
#endif /* _BCACHEFS_ASYNC_OBJS_H */

View File

@ -1,25 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_ASYNC_OBJS_TYPES_H
#define _BCACHEFS_ASYNC_OBJS_TYPES_H
#define BCH_ASYNC_OBJ_LISTS() \
x(promote) \
x(rbio) \
x(write_op) \
x(btree_read_bio) \
x(btree_write_bio)
enum bch_async_obj_lists {
#define x(n) BCH_ASYNC_OBJ_LIST_##n,
BCH_ASYNC_OBJ_LISTS()
#undef x
BCH_ASYNC_OBJ_NR
};
struct async_obj_list {
struct fast_list list;
void (*obj_to_text)(struct printbuf *, void *);
unsigned idx;
};
#endif /* _BCACHEFS_ASYNC_OBJS_TYPES_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,200 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BACKPOINTERS_H
#define _BCACHEFS_BACKPOINTERS_H
#include "btree_cache.h"
#include "btree_iter.h"
#include "btree_update.h"
#include "buckets.h"
#include "error.h"
#include "super.h"
static inline u64 swab40(u64 x)
{
return (((x & 0x00000000ffULL) << 32)|
((x & 0x000000ff00ULL) << 16)|
((x & 0x0000ff0000ULL) >> 0)|
((x & 0x00ff000000ULL) >> 16)|
((x & 0xff00000000ULL) >> 32));
}
int bch2_backpointer_validate(struct bch_fs *, struct bkey_s_c k,
struct bkey_validate_context);
void bch2_backpointer_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
void bch2_backpointer_swab(struct bkey_s);
#define bch2_bkey_ops_backpointer ((struct bkey_ops) { \
.key_validate = bch2_backpointer_validate, \
.val_to_text = bch2_backpointer_to_text, \
.swab = bch2_backpointer_swab, \
.min_val_size = 32, \
})
#define MAX_EXTENT_COMPRESS_RATIO_SHIFT 10
/*
* Convert from pos in backpointer btree to pos of corresponding bucket in alloc
* btree:
*/
static inline struct bpos bp_pos_to_bucket(const struct bch_dev *ca, struct bpos bp_pos)
{
u64 bucket_sector = bp_pos.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT;
return POS(bp_pos.inode, sector_to_bucket(ca, bucket_sector));
}
static inline struct bpos bp_pos_to_bucket_and_offset(const struct bch_dev *ca, struct bpos bp_pos,
u32 *bucket_offset)
{
u64 bucket_sector = bp_pos.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT;
return POS(bp_pos.inode, sector_to_bucket_and_offset(ca, bucket_sector, bucket_offset));
}
static inline bool bp_pos_to_bucket_nodev_noerror(struct bch_fs *c, struct bpos bp_pos, struct bpos *bucket)
{
guard(rcu)();
struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp_pos.inode);
if (ca)
*bucket = bp_pos_to_bucket(ca, bp_pos);
return ca != NULL;
}
static inline struct bpos bucket_pos_to_bp_noerror(const struct bch_dev *ca,
struct bpos bucket,
u64 bucket_offset)
{
return POS(bucket.inode,
(bucket_to_sector(ca, bucket.offset) <<
MAX_EXTENT_COMPRESS_RATIO_SHIFT) + bucket_offset);
}
/*
* Convert from pos in alloc btree + bucket offset to pos in backpointer btree:
*/
static inline struct bpos bucket_pos_to_bp(const struct bch_dev *ca,
struct bpos bucket,
u64 bucket_offset)
{
struct bpos ret = bucket_pos_to_bp_noerror(ca, bucket, bucket_offset);
EBUG_ON(!bkey_eq(bucket, bp_pos_to_bucket(ca, ret)));
return ret;
}
static inline struct bpos bucket_pos_to_bp_start(const struct bch_dev *ca, struct bpos bucket)
{
return bucket_pos_to_bp(ca, bucket, 0);
}
static inline struct bpos bucket_pos_to_bp_end(const struct bch_dev *ca, struct bpos bucket)
{
return bpos_nosnap_predecessor(bucket_pos_to_bp(ca, bpos_nosnap_successor(bucket), 0));
}
int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *,
struct bkey_s_c,
struct bkey_i_backpointer *,
bool);
static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans,
struct bkey_s_c orig_k,
struct bkey_i_backpointer *bp,
bool insert)
{
if (static_branch_unlikely(&bch2_backpointers_no_use_write_buffer))
return bch2_bucket_backpointer_mod_nowritebuffer(trans, orig_k, bp, insert);
if (!insert) {
bp->k.type = KEY_TYPE_deleted;
set_bkey_val_u64s(&bp->k, 0);
}
return bch2_trans_update_buffered(trans, BTREE_ID_backpointers, &bp->k_i);
}
static inline enum bch_data_type bch2_bkey_ptr_data_type(struct bkey_s_c k,
struct extent_ptr_decoded p,
const union bch_extent_entry *entry)
{
switch (k.k->type) {
case KEY_TYPE_btree_ptr:
case KEY_TYPE_btree_ptr_v2:
return BCH_DATA_btree;
case KEY_TYPE_extent:
case KEY_TYPE_reflink_v:
if (p.has_ec)
return BCH_DATA_stripe;
if (p.ptr.cached)
return BCH_DATA_cached;
else
return BCH_DATA_user;
case KEY_TYPE_stripe: {
const struct bch_extent_ptr *ptr = &entry->ptr;
struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
BUG_ON(ptr < s.v->ptrs ||
ptr >= s.v->ptrs + s.v->nr_blocks);
return ptr >= s.v->ptrs + s.v->nr_blocks - s.v->nr_redundant
? BCH_DATA_parity
: BCH_DATA_user;
}
default:
BUG();
}
}
static inline void bch2_extent_ptr_to_bp(struct bch_fs *c,
enum btree_id btree_id, unsigned level,
struct bkey_s_c k, struct extent_ptr_decoded p,
const union bch_extent_entry *entry,
struct bkey_i_backpointer *bp)
{
bkey_backpointer_init(&bp->k_i);
bp->k.p.inode = p.ptr.dev;
if (k.k->type != KEY_TYPE_stripe)
bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset;
else {
/*
* Put stripe backpointers where they won't collide with the
* extent backpointers within the stripe:
*/
struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) <<
MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1;
}
bp->v = (struct bch_backpointer) {
.btree_id = btree_id,
.level = level,
.data_type = bch2_bkey_ptr_data_type(k, p, entry),
.bucket_gen = p.ptr.gen,
.bucket_len = ptr_disk_sectors(level ? btree_sectors(c) : k.k->size, p),
.pos = k.k->p,
};
}
struct bkey_buf;
struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *, struct bkey_s_c_backpointer,
struct btree_iter *, unsigned, struct bkey_buf *);
struct btree *bch2_backpointer_get_node(struct btree_trans *, struct bkey_s_c_backpointer,
struct btree_iter *, struct bkey_buf *);
int bch2_check_bucket_backpointer_mismatch(struct btree_trans *, struct bch_dev *, u64,
bool, struct bkey_buf *);
int bch2_check_btree_backpointers(struct bch_fs *);
int bch2_check_extents_to_backpointers(struct bch_fs *);
int bch2_check_backpointers_to_extents(struct bch_fs *);
static inline bool bch2_bucket_bitmap_test(struct bucket_bitmap *b, u64 i)
{
unsigned long *bitmap = READ_ONCE(b->buckets);
return bitmap && test_bit(i, bitmap);
}
int bch2_bucket_bitmap_resize(struct bch_dev *, struct bucket_bitmap *, u64, u64);
void bch2_bucket_bitmap_free(struct bucket_bitmap *);
#endif /* _BCACHEFS_BACKPOINTERS_BACKGROUND_H */

View File

@ -1,37 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BBPOS_H
#define _BCACHEFS_BBPOS_H
#include "bbpos_types.h"
#include "bkey_methods.h"
#include "btree_cache.h"
static inline int bbpos_cmp(struct bbpos l, struct bbpos r)
{
return cmp_int(l.btree, r.btree) ?: bpos_cmp(l.pos, r.pos);
}
static inline struct bbpos bbpos_successor(struct bbpos pos)
{
if (bpos_cmp(pos.pos, SPOS_MAX)) {
pos.pos = bpos_successor(pos.pos);
return pos;
}
if (pos.btree != BTREE_ID_NR) {
pos.btree++;
pos.pos = POS_MIN;
return pos;
}
BUG();
}
static inline void bch2_bbpos_to_text(struct printbuf *out, struct bbpos pos)
{
bch2_btree_id_to_text(out, pos.btree);
prt_char(out, ':');
bch2_bpos_to_text(out, pos.pos);
}
#endif /* _BCACHEFS_BBPOS_H */

View File

@ -1,18 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BBPOS_TYPES_H
#define _BCACHEFS_BBPOS_TYPES_H
struct bbpos {
enum btree_id btree;
struct bpos pos;
};
static inline struct bbpos BBPOS(enum btree_id btree, struct bpos pos)
{
return (struct bbpos) { btree, pos };
}
#define BBPOS_MIN BBPOS(0, POS_MIN)
#define BBPOS_MAX BBPOS(BTREE_ID_NR - 1, SPOS_MAX)
#endif /* _BCACHEFS_BBPOS_TYPES_H */

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,473 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_IOCTL_H
#define _BCACHEFS_IOCTL_H
#include <linux/uuid.h>
#include <asm/ioctl.h>
#include "bcachefs_format.h"
#include "bkey_types.h"
/*
* Flags common to multiple ioctls:
*/
#define BCH_FORCE_IF_DATA_LOST (1 << 0)
#define BCH_FORCE_IF_METADATA_LOST (1 << 1)
#define BCH_FORCE_IF_DATA_DEGRADED (1 << 2)
#define BCH_FORCE_IF_METADATA_DEGRADED (1 << 3)
#define BCH_FORCE_IF_LOST \
(BCH_FORCE_IF_DATA_LOST| \
BCH_FORCE_IF_METADATA_LOST)
#define BCH_FORCE_IF_DEGRADED \
(BCH_FORCE_IF_DATA_DEGRADED| \
BCH_FORCE_IF_METADATA_DEGRADED)
/*
* If cleared, ioctl that refer to a device pass it as a pointer to a pathname
* (e.g. /dev/sda1); if set, the dev field is the device's index within the
* filesystem:
*/
#define BCH_BY_INDEX (1 << 4)
/*
* For BCH_IOCTL_READ_SUPER: get superblock of a specific device, not filesystem
* wide superblock:
*/
#define BCH_READ_DEV (1 << 5)
/* global control dev: */
/* These are currently broken, and probably unnecessary: */
#if 0
#define BCH_IOCTL_ASSEMBLE _IOW(0xbc, 1, struct bch_ioctl_assemble)
#define BCH_IOCTL_INCREMENTAL _IOW(0xbc, 2, struct bch_ioctl_incremental)
struct bch_ioctl_assemble {
__u32 flags;
__u32 nr_devs;
__u64 pad;
__u64 devs[];
};
struct bch_ioctl_incremental {
__u32 flags;
__u64 pad;
__u64 dev;
};
#endif
/* filesystem ioctls: */
#define BCH_IOCTL_QUERY_UUID _IOR(0xbc, 1, struct bch_ioctl_query_uuid)
/* These only make sense when we also have incremental assembly */
#if 0
#define BCH_IOCTL_START _IOW(0xbc, 2, struct bch_ioctl_start)
#define BCH_IOCTL_STOP _IO(0xbc, 3)
#endif
#define BCH_IOCTL_DISK_ADD _IOW(0xbc, 4, struct bch_ioctl_disk)
#define BCH_IOCTL_DISK_REMOVE _IOW(0xbc, 5, struct bch_ioctl_disk)
#define BCH_IOCTL_DISK_ONLINE _IOW(0xbc, 6, struct bch_ioctl_disk)
#define BCH_IOCTL_DISK_OFFLINE _IOW(0xbc, 7, struct bch_ioctl_disk)
#define BCH_IOCTL_DISK_SET_STATE _IOW(0xbc, 8, struct bch_ioctl_disk_set_state)
#define BCH_IOCTL_DATA _IOW(0xbc, 10, struct bch_ioctl_data)
#define BCH_IOCTL_FS_USAGE _IOWR(0xbc, 11, struct bch_ioctl_fs_usage)
#define BCH_IOCTL_DEV_USAGE _IOWR(0xbc, 11, struct bch_ioctl_dev_usage)
#define BCH_IOCTL_READ_SUPER _IOW(0xbc, 12, struct bch_ioctl_read_super)
#define BCH_IOCTL_DISK_GET_IDX _IOW(0xbc, 13, struct bch_ioctl_disk_get_idx)
#define BCH_IOCTL_DISK_RESIZE _IOW(0xbc, 14, struct bch_ioctl_disk_resize)
#define BCH_IOCTL_DISK_RESIZE_JOURNAL _IOW(0xbc,15, struct bch_ioctl_disk_resize_journal)
#define BCH_IOCTL_SUBVOLUME_CREATE _IOW(0xbc, 16, struct bch_ioctl_subvolume)
#define BCH_IOCTL_SUBVOLUME_DESTROY _IOW(0xbc, 17, struct bch_ioctl_subvolume)
#define BCH_IOCTL_DEV_USAGE_V2 _IOWR(0xbc, 18, struct bch_ioctl_dev_usage_v2)
#define BCH_IOCTL_FSCK_OFFLINE _IOW(0xbc, 19, struct bch_ioctl_fsck_offline)
#define BCH_IOCTL_FSCK_ONLINE _IOW(0xbc, 20, struct bch_ioctl_fsck_online)
#define BCH_IOCTL_QUERY_ACCOUNTING _IOW(0xbc, 21, struct bch_ioctl_query_accounting)
#define BCH_IOCTL_QUERY_COUNTERS _IOW(0xbc, 21, struct bch_ioctl_query_counters)
/* ioctl below act on a particular file, not the filesystem as a whole: */
#define BCHFS_IOC_REINHERIT_ATTRS _IOR(0xbc, 64, const char __user *)
/*
* BCH_IOCTL_QUERY_UUID: get filesystem UUID
*
* Returns user visible UUID, not internal UUID (which may not ever be changed);
* the filesystem's sysfs directory may be found under /sys/fs/bcachefs with
* this UUID.
*/
struct bch_ioctl_query_uuid {
__uuid_t uuid;
};
#if 0
struct bch_ioctl_start {
__u32 flags;
__u32 pad;
};
#endif
/*
* BCH_IOCTL_DISK_ADD: add a new device to an existing filesystem
*
* The specified device must not be open or in use. On success, the new device
* will be an online member of the filesystem just like any other member.
*
* The device must first be prepared by userspace by formatting with a bcachefs
* superblock, which is only used for passing in superblock options/parameters
* for that device (in struct bch_member). The new device's superblock should
* not claim to be a member of any existing filesystem - UUIDs on it will be
* ignored.
*/
/*
* BCH_IOCTL_DISK_REMOVE: permanently remove a member device from a filesystem
*
* Any data present on @dev will be permanently deleted, and @dev will be
* removed from its slot in the filesystem's list of member devices. The device
* may be either offline or offline.
*
* Will fail removing @dev would leave us with insufficient read write devices
* or degraded/unavailable data, unless the approprate BCH_FORCE_IF_* flags are
* set.
*/
/*
* BCH_IOCTL_DISK_ONLINE: given a disk that is already a member of a filesystem
* but is not open (e.g. because we started in degraded mode), bring it online
*
* all existing data on @dev will be available once the device is online,
* exactly as if @dev was present when the filesystem was first mounted
*/
/*
* BCH_IOCTL_DISK_OFFLINE: offline a disk, causing the kernel to close that
* block device, without removing it from the filesystem (so it can be brought
* back online later)
*
* Data present on @dev will be unavailable while @dev is offline (unless
* replicated), but will still be intact and untouched if @dev is brought back
* online
*
* Will fail (similarly to BCH_IOCTL_DISK_SET_STATE) if offlining @dev would
* leave us with insufficient read write devices or degraded/unavailable data,
* unless the approprate BCH_FORCE_IF_* flags are set.
*/
struct bch_ioctl_disk {
__u32 flags;
__u32 pad;
__u64 dev;
};
/*
* BCH_IOCTL_DISK_SET_STATE: modify state of a member device of a filesystem
*
* @new_state - one of the bch_member_state states (rw, ro, failed,
* spare)
*
* Will refuse to change member state if we would then have insufficient devices
* to write to, or if it would result in degraded data (when @new_state is
* failed or spare) unless the appropriate BCH_FORCE_IF_* flags are set.
*/
struct bch_ioctl_disk_set_state {
__u32 flags;
__u8 new_state;
__u8 pad[3];
__u64 dev;
};
#define BCH_DATA_OPS() \
x(scrub, 0) \
x(rereplicate, 1) \
x(migrate, 2) \
x(rewrite_old_nodes, 3) \
x(drop_extra_replicas, 4)
enum bch_data_ops {
#define x(t, n) BCH_DATA_OP_##t = n,
BCH_DATA_OPS()
#undef x
BCH_DATA_OP_NR
};
/*
* BCH_IOCTL_DATA: operations that walk and manipulate filesystem data (e.g.
* scrub, rereplicate, migrate).
*
* This ioctl kicks off a job in the background, and returns a file descriptor.
* Reading from the file descriptor returns a struct bch_ioctl_data_event,
* indicating current progress, and closing the file descriptor will stop the
* job. The file descriptor is O_CLOEXEC.
*/
struct bch_ioctl_data {
__u16 op;
__u8 start_btree;
__u8 end_btree;
__u32 flags;
struct bpos start_pos;
struct bpos end_pos;
union {
struct {
__u32 dev;
__u32 data_types;
} scrub;
struct {
__u32 dev;
__u32 pad;
} migrate;
struct {
__u64 pad[8];
};
};
} __packed __aligned(8);
enum bch_data_event {
BCH_DATA_EVENT_PROGRESS = 0,
/* XXX: add an event for reporting errors */
BCH_DATA_EVENT_NR = 1,
};
enum data_progress_data_type_special {
DATA_PROGRESS_DATA_TYPE_phys = 254,
DATA_PROGRESS_DATA_TYPE_done = 255,
};
struct bch_ioctl_data_progress {
__u8 data_type;
__u8 btree_id;
__u8 pad[2];
struct bpos pos;
__u64 sectors_done;
__u64 sectors_total;
__u64 sectors_error_corrected;
__u64 sectors_error_uncorrected;
} __packed __aligned(8);
enum bch_ioctl_data_event_ret {
BCH_IOCTL_DATA_EVENT_RET_done = 1,
BCH_IOCTL_DATA_EVENT_RET_device_offline = 2,
};
struct bch_ioctl_data_event {
__u8 type;
__u8 ret;
__u8 pad[6];
union {
struct bch_ioctl_data_progress p;
__u64 pad2[15];
};
} __packed __aligned(8);
struct bch_replicas_usage {
__u64 sectors;
struct bch_replicas_entry_v1 r;
} __packed;
static inline unsigned replicas_usage_bytes(struct bch_replicas_usage *u)
{
return offsetof(struct bch_replicas_usage, r) + replicas_entry_bytes(&u->r);
}
static inline struct bch_replicas_usage *
replicas_usage_next(struct bch_replicas_usage *u)
{
return (void *) u + replicas_usage_bytes(u);
}
/* Obsolete */
/*
* BCH_IOCTL_FS_USAGE: query filesystem disk space usage
*
* Returns disk space usage broken out by data type, number of replicas, and
* by component device
*
* @replica_entries_bytes - size, in bytes, allocated for replica usage entries
*
* On success, @replica_entries_bytes will be changed to indicate the number of
* bytes actually used.
*
* Returns -ERANGE if @replica_entries_bytes was too small
*/
struct bch_ioctl_fs_usage {
__u64 capacity;
__u64 used;
__u64 online_reserved;
__u64 persistent_reserved[BCH_REPLICAS_MAX];
__u32 replica_entries_bytes;
__u32 pad;
struct bch_replicas_usage replicas[];
};
/* Obsolete */
/*
* BCH_IOCTL_DEV_USAGE: query device disk space usage
*
* Returns disk space usage broken out by data type - both by buckets and
* sectors.
*/
struct bch_ioctl_dev_usage {
__u64 dev;
__u32 flags;
__u8 state;
__u8 pad[7];
__u32 bucket_size;
__u64 nr_buckets;
__u64 buckets_ec;
struct bch_ioctl_dev_usage_type {
__u64 buckets;
__u64 sectors;
__u64 fragmented;
} d[10];
};
/* Obsolete */
struct bch_ioctl_dev_usage_v2 {
__u64 dev;
__u32 flags;
__u8 state;
__u8 nr_data_types;
__u8 pad[6];
__u32 bucket_size;
__u64 nr_buckets;
struct bch_ioctl_dev_usage_type d[];
};
/*
* BCH_IOCTL_READ_SUPER: read filesystem superblock
*
* Equivalent to reading the superblock directly from the block device, except
* avoids racing with the kernel writing the superblock or having to figure out
* which block device to read
*
* @sb - buffer to read into
* @size - size of userspace allocated buffer
* @dev - device to read superblock for, if BCH_READ_DEV flag is
* specified
*
* Returns -ERANGE if buffer provided is too small
*/
struct bch_ioctl_read_super {
__u32 flags;
__u32 pad;
__u64 dev;
__u64 size;
__u64 sb;
};
/*
* BCH_IOCTL_DISK_GET_IDX: give a path to a block device, query filesystem to
* determine if disk is a (online) member - if so, returns device's index
*
* Returns -ENOENT if not found
*/
struct bch_ioctl_disk_get_idx {
__u64 dev;
};
/*
* BCH_IOCTL_DISK_RESIZE: resize filesystem on a device
*
* @dev - member to resize
* @nbuckets - new number of buckets
*/
struct bch_ioctl_disk_resize {
__u32 flags;
__u32 pad;
__u64 dev;
__u64 nbuckets;
};
/*
* BCH_IOCTL_DISK_RESIZE_JOURNAL: resize journal on a device
*
* @dev - member to resize
* @nbuckets - new number of buckets
*/
struct bch_ioctl_disk_resize_journal {
__u32 flags;
__u32 pad;
__u64 dev;
__u64 nbuckets;
};
struct bch_ioctl_subvolume {
__u32 flags;
__u32 dirfd;
__u16 mode;
__u16 pad[3];
__u64 dst_ptr;
__u64 src_ptr;
};
#define BCH_SUBVOL_SNAPSHOT_CREATE (1U << 0)
#define BCH_SUBVOL_SNAPSHOT_RO (1U << 1)
/*
* BCH_IOCTL_FSCK_OFFLINE: run fsck from the 'bcachefs fsck' userspace command,
* but with the kernel's implementation of fsck:
*/
struct bch_ioctl_fsck_offline {
__u64 flags;
__u64 opts; /* string */
__u64 nr_devs;
__u64 devs[] __counted_by(nr_devs);
};
/*
* BCH_IOCTL_FSCK_ONLINE: run fsck from the 'bcachefs fsck' userspace command,
* but with the kernel's implementation of fsck:
*/
struct bch_ioctl_fsck_online {
__u64 flags;
__u64 opts; /* string */
};
/*
* BCH_IOCTL_QUERY_ACCOUNTING: query filesystem disk accounting
*
* Returns disk space usage broken out by data type, number of replicas, and
* by component device
*
* @replica_entries_bytes - size, in bytes, allocated for replica usage entries
*
* On success, @replica_entries_bytes will be changed to indicate the number of
* bytes actually used.
*
* Returns -ERANGE if @replica_entries_bytes was too small
*/
struct bch_ioctl_query_accounting {
__u64 capacity;
__u64 used;
__u64 online_reserved;
__u32 accounting_u64s; /* input parameter */
__u32 accounting_types_mask; /* input parameter */
struct bkey_i_accounting accounting[];
};
#define BCH_IOCTL_QUERY_COUNTERS_MOUNT (1 << 0)
struct bch_ioctl_query_counters {
__u16 nr;
__u16 flags;
__u32 pad;
__u64 d[];
};
#endif /* _BCACHEFS_IOCTL_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,605 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_H
#define _BCACHEFS_BKEY_H
#include <linux/bug.h>
#include "bcachefs_format.h"
#include "bkey_types.h"
#include "btree_types.h"
#include "util.h"
#include "vstructs.h"
#if 0
/*
* compiled unpack functions are disabled, pending a new interface for
* dynamically allocating executable memory:
*/
#ifdef CONFIG_X86_64
#define HAVE_BCACHEFS_COMPILED_UNPACK 1
#endif
#endif
void bch2_bkey_packed_to_binary_text(struct printbuf *,
const struct bkey_format *,
const struct bkey_packed *);
enum bkey_lr_packed {
BKEY_PACKED_BOTH,
BKEY_PACKED_RIGHT,
BKEY_PACKED_LEFT,
BKEY_PACKED_NONE,
};
#define bkey_lr_packed(_l, _r) \
((_l)->format + ((_r)->format << 1))
static inline void bkey_p_copy(struct bkey_packed *dst, const struct bkey_packed *src)
{
memcpy_u64s_small(dst, src, src->u64s);
}
static inline void bkey_copy(struct bkey_i *dst, const struct bkey_i *src)
{
memcpy_u64s_small(dst, src, src->k.u64s);
}
struct btree;
__pure
unsigned bch2_bkey_greatest_differing_bit(const struct btree *,
const struct bkey_packed *,
const struct bkey_packed *);
__pure
unsigned bch2_bkey_ffs(const struct btree *, const struct bkey_packed *);
__pure
int __bch2_bkey_cmp_packed_format_checked(const struct bkey_packed *,
const struct bkey_packed *,
const struct btree *);
__pure
int __bch2_bkey_cmp_left_packed_format_checked(const struct btree *,
const struct bkey_packed *,
const struct bpos *);
__pure
int bch2_bkey_cmp_packed(const struct btree *,
const struct bkey_packed *,
const struct bkey_packed *);
__pure
int __bch2_bkey_cmp_left_packed(const struct btree *,
const struct bkey_packed *,
const struct bpos *);
static inline __pure
int bkey_cmp_left_packed(const struct btree *b,
const struct bkey_packed *l, const struct bpos *r)
{
return __bch2_bkey_cmp_left_packed(b, l, r);
}
/*
* The compiler generates better code when we pass bpos by ref, but it's often
* enough terribly convenient to pass it by val... as much as I hate c++, const
* ref would be nice here:
*/
__pure __flatten
static inline int bkey_cmp_left_packed_byval(const struct btree *b,
const struct bkey_packed *l,
struct bpos r)
{
return bkey_cmp_left_packed(b, l, &r);
}
static __always_inline bool bpos_eq(struct bpos l, struct bpos r)
{
return !((l.inode ^ r.inode) |
(l.offset ^ r.offset) |
(l.snapshot ^ r.snapshot));
}
static __always_inline bool bpos_lt(struct bpos l, struct bpos r)
{
return l.inode != r.inode ? l.inode < r.inode :
l.offset != r.offset ? l.offset < r.offset :
l.snapshot != r.snapshot ? l.snapshot < r.snapshot : false;
}
static __always_inline bool bpos_le(struct bpos l, struct bpos r)
{
return l.inode != r.inode ? l.inode < r.inode :
l.offset != r.offset ? l.offset < r.offset :
l.snapshot != r.snapshot ? l.snapshot < r.snapshot : true;
}
static __always_inline bool bpos_gt(struct bpos l, struct bpos r)
{
return bpos_lt(r, l);
}
static __always_inline bool bpos_ge(struct bpos l, struct bpos r)
{
return bpos_le(r, l);
}
static __always_inline int bpos_cmp(struct bpos l, struct bpos r)
{
return cmp_int(l.inode, r.inode) ?:
cmp_int(l.offset, r.offset) ?:
cmp_int(l.snapshot, r.snapshot);
}
static inline struct bpos bpos_min(struct bpos l, struct bpos r)
{
return bpos_lt(l, r) ? l : r;
}
static inline struct bpos bpos_max(struct bpos l, struct bpos r)
{
return bpos_gt(l, r) ? l : r;
}
static __always_inline bool bkey_eq(struct bpos l, struct bpos r)
{
return !((l.inode ^ r.inode) |
(l.offset ^ r.offset));
}
static __always_inline bool bkey_lt(struct bpos l, struct bpos r)
{
return l.inode != r.inode
? l.inode < r.inode
: l.offset < r.offset;
}
static __always_inline bool bkey_le(struct bpos l, struct bpos r)
{
return l.inode != r.inode
? l.inode < r.inode
: l.offset <= r.offset;
}
static __always_inline bool bkey_gt(struct bpos l, struct bpos r)
{
return bkey_lt(r, l);
}
static __always_inline bool bkey_ge(struct bpos l, struct bpos r)
{
return bkey_le(r, l);
}
static __always_inline int bkey_cmp(struct bpos l, struct bpos r)
{
return cmp_int(l.inode, r.inode) ?:
cmp_int(l.offset, r.offset);
}
static inline struct bpos bkey_min(struct bpos l, struct bpos r)
{
return bkey_lt(l, r) ? l : r;
}
static inline struct bpos bkey_max(struct bpos l, struct bpos r)
{
return bkey_gt(l, r) ? l : r;
}
static inline bool bkey_and_val_eq(struct bkey_s_c l, struct bkey_s_c r)
{
return bpos_eq(l.k->p, r.k->p) &&
l.k->size == r.k->size &&
bkey_bytes(l.k) == bkey_bytes(r.k) &&
!memcmp(l.v, r.v, bkey_val_bytes(l.k));
}
void bch2_bpos_swab(struct bpos *);
void bch2_bkey_swab_key(const struct bkey_format *, struct bkey_packed *);
static __always_inline int bversion_cmp(struct bversion l, struct bversion r)
{
return cmp_int(l.hi, r.hi) ?:
cmp_int(l.lo, r.lo);
}
#define ZERO_VERSION ((struct bversion) { .hi = 0, .lo = 0 })
#define MAX_VERSION ((struct bversion) { .hi = ~0, .lo = ~0ULL })
static __always_inline bool bversion_zero(struct bversion v)
{
return bversion_cmp(v, ZERO_VERSION) == 0;
}
#ifdef CONFIG_BCACHEFS_DEBUG
/* statement expressions confusing unlikely()? */
#define bkey_packed(_k) \
({ EBUG_ON((_k)->format > KEY_FORMAT_CURRENT); \
(_k)->format != KEY_FORMAT_CURRENT; })
#else
#define bkey_packed(_k) ((_k)->format != KEY_FORMAT_CURRENT)
#endif
/*
* It's safe to treat an unpacked bkey as a packed one, but not the reverse
*/
static inline struct bkey_packed *bkey_to_packed(struct bkey_i *k)
{
return (struct bkey_packed *) k;
}
static inline const struct bkey_packed *bkey_to_packed_c(const struct bkey_i *k)
{
return (const struct bkey_packed *) k;
}
static inline struct bkey_i *packed_to_bkey(struct bkey_packed *k)
{
return bkey_packed(k) ? NULL : (struct bkey_i *) k;
}
static inline const struct bkey *packed_to_bkey_c(const struct bkey_packed *k)
{
return bkey_packed(k) ? NULL : (const struct bkey *) k;
}
static inline unsigned bkey_format_key_bits(const struct bkey_format *format)
{
return format->bits_per_field[BKEY_FIELD_INODE] +
format->bits_per_field[BKEY_FIELD_OFFSET] +
format->bits_per_field[BKEY_FIELD_SNAPSHOT];
}
static inline struct bpos bpos_successor(struct bpos p)
{
if (!++p.snapshot &&
!++p.offset &&
!++p.inode)
BUG();
return p;
}
static inline struct bpos bpos_predecessor(struct bpos p)
{
if (!p.snapshot-- &&
!p.offset-- &&
!p.inode--)
BUG();
return p;
}
static inline struct bpos bpos_nosnap_successor(struct bpos p)
{
p.snapshot = 0;
if (!++p.offset &&
!++p.inode)
BUG();
return p;
}
static inline struct bpos bpos_nosnap_predecessor(struct bpos p)
{
p.snapshot = 0;
if (!p.offset-- &&
!p.inode--)
BUG();
return p;
}
static inline u64 bkey_start_offset(const struct bkey *k)
{
return k->p.offset - k->size;
}
static inline struct bpos bkey_start_pos(const struct bkey *k)
{
return (struct bpos) {
.inode = k->p.inode,
.offset = bkey_start_offset(k),
.snapshot = k->p.snapshot,
};
}
/* Packed helpers */
static inline unsigned bkeyp_key_u64s(const struct bkey_format *format,
const struct bkey_packed *k)
{
return bkey_packed(k) ? format->key_u64s : BKEY_U64s;
}
static inline bool bkeyp_u64s_valid(const struct bkey_format *f,
const struct bkey_packed *k)
{
return ((unsigned) k->u64s - bkeyp_key_u64s(f, k) <= U8_MAX - BKEY_U64s);
}
static inline unsigned bkeyp_key_bytes(const struct bkey_format *format,
const struct bkey_packed *k)
{
return bkeyp_key_u64s(format, k) * sizeof(u64);
}
static inline unsigned bkeyp_val_u64s(const struct bkey_format *format,
const struct bkey_packed *k)
{
return k->u64s - bkeyp_key_u64s(format, k);
}
static inline size_t bkeyp_val_bytes(const struct bkey_format *format,
const struct bkey_packed *k)
{
return bkeyp_val_u64s(format, k) * sizeof(u64);
}
static inline void set_bkeyp_val_u64s(const struct bkey_format *format,
struct bkey_packed *k, unsigned val_u64s)
{
k->u64s = bkeyp_key_u64s(format, k) + val_u64s;
}
#define bkeyp_val(_format, _k) \
((struct bch_val *) ((u64 *) (_k)->_data + bkeyp_key_u64s(_format, _k)))
extern const struct bkey_format bch2_bkey_format_current;
bool bch2_bkey_transform(const struct bkey_format *,
struct bkey_packed *,
const struct bkey_format *,
const struct bkey_packed *);
struct bkey __bch2_bkey_unpack_key(const struct bkey_format *,
const struct bkey_packed *);
#ifndef HAVE_BCACHEFS_COMPILED_UNPACK
struct bpos __bkey_unpack_pos(const struct bkey_format *,
const struct bkey_packed *);
#endif
bool bch2_bkey_pack_key(struct bkey_packed *, const struct bkey *,
const struct bkey_format *);
enum bkey_pack_pos_ret {
BKEY_PACK_POS_EXACT,
BKEY_PACK_POS_SMALLER,
BKEY_PACK_POS_FAIL,
};
enum bkey_pack_pos_ret bch2_bkey_pack_pos_lossy(struct bkey_packed *, struct bpos,
const struct btree *);
static inline bool bkey_pack_pos(struct bkey_packed *out, struct bpos in,
const struct btree *b)
{
return bch2_bkey_pack_pos_lossy(out, in, b) == BKEY_PACK_POS_EXACT;
}
void bch2_bkey_unpack(const struct btree *, struct bkey_i *,
const struct bkey_packed *);
bool bch2_bkey_pack(struct bkey_packed *, const struct bkey_i *,
const struct bkey_format *);
typedef void (*compiled_unpack_fn)(struct bkey *, const struct bkey_packed *);
static inline void
__bkey_unpack_key_format_checked(const struct btree *b,
struct bkey *dst,
const struct bkey_packed *src)
{
if (IS_ENABLED(HAVE_BCACHEFS_COMPILED_UNPACK)) {
compiled_unpack_fn unpack_fn = b->aux_data;
unpack_fn(dst, src);
if (static_branch_unlikely(&bch2_debug_check_bkey_unpack)) {
struct bkey dst2 = __bch2_bkey_unpack_key(&b->format, src);
BUG_ON(memcmp(dst, &dst2, sizeof(*dst)));
}
} else {
*dst = __bch2_bkey_unpack_key(&b->format, src);
}
}
static inline struct bkey
bkey_unpack_key_format_checked(const struct btree *b,
const struct bkey_packed *src)
{
struct bkey dst;
__bkey_unpack_key_format_checked(b, &dst, src);
return dst;
}
static inline void __bkey_unpack_key(const struct btree *b,
struct bkey *dst,
const struct bkey_packed *src)
{
if (likely(bkey_packed(src)))
__bkey_unpack_key_format_checked(b, dst, src);
else
*dst = *packed_to_bkey_c(src);
}
/**
* bkey_unpack_key -- unpack just the key, not the value
*/
static inline struct bkey bkey_unpack_key(const struct btree *b,
const struct bkey_packed *src)
{
return likely(bkey_packed(src))
? bkey_unpack_key_format_checked(b, src)
: *packed_to_bkey_c(src);
}
static inline struct bpos
bkey_unpack_pos_format_checked(const struct btree *b,
const struct bkey_packed *src)
{
#ifdef HAVE_BCACHEFS_COMPILED_UNPACK
return bkey_unpack_key_format_checked(b, src).p;
#else
return __bkey_unpack_pos(&b->format, src);
#endif
}
static inline struct bpos bkey_unpack_pos(const struct btree *b,
const struct bkey_packed *src)
{
return likely(bkey_packed(src))
? bkey_unpack_pos_format_checked(b, src)
: packed_to_bkey_c(src)->p;
}
/* Disassembled bkeys */
static inline struct bkey_s_c bkey_disassemble(const struct btree *b,
const struct bkey_packed *k,
struct bkey *u)
{
__bkey_unpack_key(b, u, k);
return (struct bkey_s_c) { u, bkeyp_val(&b->format, k), };
}
/* non const version: */
static inline struct bkey_s __bkey_disassemble(const struct btree *b,
struct bkey_packed *k,
struct bkey *u)
{
__bkey_unpack_key(b, u, k);
return (struct bkey_s) { .k = u, .v = bkeyp_val(&b->format, k), };
}
static inline u64 bkey_field_max(const struct bkey_format *f,
enum bch_bkey_fields nr)
{
return f->bits_per_field[nr] < 64
? (le64_to_cpu(f->field_offset[nr]) +
~(~0ULL << f->bits_per_field[nr]))
: U64_MAX;
}
#ifdef HAVE_BCACHEFS_COMPILED_UNPACK
int bch2_compile_bkey_format(const struct bkey_format *, void *);
#else
static inline int bch2_compile_bkey_format(const struct bkey_format *format,
void *out) { return 0; }
#endif
static inline void bkey_reassemble(struct bkey_i *dst,
struct bkey_s_c src)
{
dst->k = *src.k;
memcpy_u64s_small(&dst->v, src.v, bkey_val_u64s(src.k));
}
/* byte order helpers */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
static inline unsigned high_word_offset(const struct bkey_format *f)
{
return f->key_u64s - 1;
}
#define high_bit_offset 0
#define nth_word(p, n) ((p) - (n))
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
static inline unsigned high_word_offset(const struct bkey_format *f)
{
return 0;
}
#define high_bit_offset KEY_PACKED_BITS_START
#define nth_word(p, n) ((p) + (n))
#else
#error edit for your odd byteorder.
#endif
#define high_word(f, k) ((u64 *) (k)->_data + high_word_offset(f))
#define next_word(p) nth_word(p, 1)
#define prev_word(p) nth_word(p, -1)
#ifdef CONFIG_BCACHEFS_DEBUG
void bch2_bkey_pack_test(void);
#else
static inline void bch2_bkey_pack_test(void) {}
#endif
#define bkey_fields() \
x(BKEY_FIELD_INODE, p.inode) \
x(BKEY_FIELD_OFFSET, p.offset) \
x(BKEY_FIELD_SNAPSHOT, p.snapshot) \
x(BKEY_FIELD_SIZE, size) \
x(BKEY_FIELD_VERSION_HI, bversion.hi) \
x(BKEY_FIELD_VERSION_LO, bversion.lo)
struct bkey_format_state {
u64 field_min[BKEY_NR_FIELDS];
u64 field_max[BKEY_NR_FIELDS];
};
void bch2_bkey_format_init(struct bkey_format_state *);
static inline void __bkey_format_add(struct bkey_format_state *s, unsigned field, u64 v)
{
s->field_min[field] = min(s->field_min[field], v);
s->field_max[field] = max(s->field_max[field], v);
}
/*
* Changes @format so that @k can be successfully packed with @format
*/
static inline void bch2_bkey_format_add_key(struct bkey_format_state *s, const struct bkey *k)
{
#define x(id, field) __bkey_format_add(s, id, k->field);
bkey_fields()
#undef x
}
void bch2_bkey_format_add_pos(struct bkey_format_state *, struct bpos);
struct bkey_format bch2_bkey_format_done(struct bkey_format_state *);
static inline bool bch2_bkey_format_field_overflows(struct bkey_format *f, unsigned i)
{
unsigned f_bits = f->bits_per_field[i];
unsigned unpacked_bits = bch2_bkey_format_current.bits_per_field[i];
u64 unpacked_mask = ~((~0ULL << 1) << (unpacked_bits - 1));
u64 field_offset = le64_to_cpu(f->field_offset[i]);
if (f_bits > unpacked_bits)
return true;
if ((f_bits == unpacked_bits) && field_offset)
return true;
u64 f_mask = f_bits
? ~((~0ULL << (f_bits - 1)) << 1)
: 0;
if (((field_offset + f_mask) & unpacked_mask) < field_offset)
return true;
return false;
}
int bch2_bkey_format_invalid(struct bch_fs *, struct bkey_format *,
enum bch_validate_flags, struct printbuf *);
void bch2_bkey_format_to_text(struct printbuf *, const struct bkey_format *);
#endif /* _BCACHEFS_BKEY_H */

View File

@ -1,61 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_BUF_H
#define _BCACHEFS_BKEY_BUF_H
#include "bcachefs.h"
#include "bkey.h"
struct bkey_buf {
struct bkey_i *k;
u64 onstack[12];
};
static inline void bch2_bkey_buf_realloc(struct bkey_buf *s,
struct bch_fs *c, unsigned u64s)
{
if (s->k == (void *) s->onstack &&
u64s > ARRAY_SIZE(s->onstack)) {
s->k = mempool_alloc(&c->large_bkey_pool, GFP_NOFS);
memcpy(s->k, s->onstack, sizeof(s->onstack));
}
}
static inline void bch2_bkey_buf_reassemble(struct bkey_buf *s,
struct bch_fs *c,
struct bkey_s_c k)
{
bch2_bkey_buf_realloc(s, c, k.k->u64s);
bkey_reassemble(s->k, k);
}
static inline void bch2_bkey_buf_copy(struct bkey_buf *s,
struct bch_fs *c,
struct bkey_i *src)
{
bch2_bkey_buf_realloc(s, c, src->k.u64s);
bkey_copy(s->k, src);
}
static inline void bch2_bkey_buf_unpack(struct bkey_buf *s,
struct bch_fs *c,
struct btree *b,
struct bkey_packed *src)
{
bch2_bkey_buf_realloc(s, c, BKEY_U64s +
bkeyp_val_u64s(&b->format, src));
bch2_bkey_unpack(b, s->k, src);
}
static inline void bch2_bkey_buf_init(struct bkey_buf *s)
{
s->k = (void *) s->onstack;
}
static inline void bch2_bkey_buf_exit(struct bkey_buf *s, struct bch_fs *c)
{
if (s->k != (void *) s->onstack)
mempool_free(s->k, &c->large_bkey_pool);
s->k = NULL;
}
#endif /* _BCACHEFS_BKEY_BUF_H */

View File

@ -1,129 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_CMP_H
#define _BCACHEFS_BKEY_CMP_H
#include "bkey.h"
#ifdef CONFIG_X86_64
static inline int __bkey_cmp_bits(const u64 *l, const u64 *r,
unsigned nr_key_bits)
{
long d0, d1, d2, d3;
int cmp;
/* we shouldn't need asm for this, but gcc is being retarded: */
asm(".intel_syntax noprefix;"
"xor eax, eax;"
"xor edx, edx;"
"1:;"
"mov r8, [rdi];"
"mov r9, [rsi];"
"sub ecx, 64;"
"jl 2f;"
"cmp r8, r9;"
"jnz 3f;"
"lea rdi, [rdi - 8];"
"lea rsi, [rsi - 8];"
"jmp 1b;"
"2:;"
"not ecx;"
"shr r8, 1;"
"shr r9, 1;"
"shr r8, cl;"
"shr r9, cl;"
"cmp r8, r9;"
"3:\n"
"seta al;"
"setb dl;"
"sub eax, edx;"
".att_syntax prefix;"
: "=&D" (d0), "=&S" (d1), "=&d" (d2), "=&c" (d3), "=&a" (cmp)
: "0" (l), "1" (r), "3" (nr_key_bits)
: "r8", "r9", "cc", "memory");
return cmp;
}
#else
static inline int __bkey_cmp_bits(const u64 *l, const u64 *r,
unsigned nr_key_bits)
{
u64 l_v, r_v;
if (!nr_key_bits)
return 0;
/* for big endian, skip past header */
nr_key_bits += high_bit_offset;
l_v = *l & (~0ULL >> high_bit_offset);
r_v = *r & (~0ULL >> high_bit_offset);
while (1) {
if (nr_key_bits < 64) {
l_v >>= 64 - nr_key_bits;
r_v >>= 64 - nr_key_bits;
nr_key_bits = 0;
} else {
nr_key_bits -= 64;
}
if (!nr_key_bits || l_v != r_v)
break;
l = next_word(l);
r = next_word(r);
l_v = *l;
r_v = *r;
}
return cmp_int(l_v, r_v);
}
#endif
static inline __pure __flatten
int __bch2_bkey_cmp_packed_format_checked_inlined(const struct bkey_packed *l,
const struct bkey_packed *r,
const struct btree *b)
{
const struct bkey_format *f = &b->format;
int ret;
EBUG_ON(!bkey_packed(l) || !bkey_packed(r));
EBUG_ON(b->nr_key_bits != bkey_format_key_bits(f));
ret = __bkey_cmp_bits(high_word(f, l),
high_word(f, r),
b->nr_key_bits);
EBUG_ON(ret != bpos_cmp(bkey_unpack_pos(b, l),
bkey_unpack_pos(b, r)));
return ret;
}
static inline __pure __flatten
int bch2_bkey_cmp_packed_inlined(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r)
{
struct bkey unpacked;
if (likely(bkey_packed(l) && bkey_packed(r)))
return __bch2_bkey_cmp_packed_format_checked_inlined(l, r, b);
if (bkey_packed(l)) {
__bkey_unpack_key_format_checked(b, &unpacked, l);
l = (void *) &unpacked;
} else if (bkey_packed(r)) {
__bkey_unpack_key_format_checked(b, &unpacked, r);
r = (void *) &unpacked;
}
return bpos_cmp(((struct bkey *) l)->p, ((struct bkey *) r)->p);
}
#endif /* _BCACHEFS_BKEY_CMP_H */

View File

@ -1,497 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "backpointers.h"
#include "bkey_methods.h"
#include "btree_cache.h"
#include "btree_types.h"
#include "alloc_background.h"
#include "dirent.h"
#include "disk_accounting.h"
#include "ec.h"
#include "error.h"
#include "extents.h"
#include "inode.h"
#include "io_misc.h"
#include "lru.h"
#include "quota.h"
#include "reflink.h"
#include "snapshot.h"
#include "subvolume.h"
#include "xattr.h"
const char * const bch2_bkey_types[] = {
#define x(name, nr, ...) #name,
BCH_BKEY_TYPES()
#undef x
NULL
};
static int deleted_key_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
return 0;
}
#define bch2_bkey_ops_deleted ((struct bkey_ops) { \
.key_validate = deleted_key_validate, \
})
#define bch2_bkey_ops_whiteout ((struct bkey_ops) { \
.key_validate = deleted_key_validate, \
})
static int empty_val_key_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
int ret = 0;
bkey_fsck_err_on(bkey_val_bytes(k.k),
c, bkey_val_size_nonzero,
"incorrect value size (%zu != 0)",
bkey_val_bytes(k.k));
fsck_err:
return ret;
}
#define bch2_bkey_ops_error ((struct bkey_ops) { \
.key_validate = empty_val_key_validate, \
})
static int key_type_cookie_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
return 0;
}
static void key_type_cookie_to_text(struct printbuf *out, struct bch_fs *c,
struct bkey_s_c k)
{
struct bkey_s_c_cookie ck = bkey_s_c_to_cookie(k);
prt_printf(out, "%llu", le64_to_cpu(ck.v->cookie));
}
#define bch2_bkey_ops_cookie ((struct bkey_ops) { \
.key_validate = key_type_cookie_validate, \
.val_to_text = key_type_cookie_to_text, \
.min_val_size = 8, \
})
#define bch2_bkey_ops_hash_whiteout ((struct bkey_ops) {\
.key_validate = empty_val_key_validate, \
})
static int key_type_inline_data_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
return 0;
}
static void key_type_inline_data_to_text(struct printbuf *out, struct bch_fs *c,
struct bkey_s_c k)
{
struct bkey_s_c_inline_data d = bkey_s_c_to_inline_data(k);
unsigned datalen = bkey_inline_data_bytes(k.k);
prt_printf(out, "datalen %u: %*phN",
datalen, min(datalen, 32U), d.v->data);
}
#define bch2_bkey_ops_inline_data ((struct bkey_ops) { \
.key_validate = key_type_inline_data_validate, \
.val_to_text = key_type_inline_data_to_text, \
})
static bool key_type_set_merge(struct bch_fs *c, struct bkey_s l, struct bkey_s_c r)
{
bch2_key_resize(l.k, l.k->size + r.k->size);
return true;
}
#define bch2_bkey_ops_set ((struct bkey_ops) { \
.key_validate = empty_val_key_validate, \
.key_merge = key_type_set_merge, \
})
const struct bkey_ops bch2_bkey_ops[] = {
#define x(name, nr, ...) [KEY_TYPE_##name] = bch2_bkey_ops_##name,
BCH_BKEY_TYPES()
#undef x
};
const struct bkey_ops bch2_bkey_null_ops = {
};
int bch2_bkey_val_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
if (test_bit(BCH_FS_no_invalid_checks, &c->flags))
return 0;
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
int ret = 0;
bkey_fsck_err_on(bkey_val_bytes(k.k) < ops->min_val_size,
c, bkey_val_size_too_small,
"bad val size (%zu < %u)",
bkey_val_bytes(k.k), ops->min_val_size);
if (!ops->key_validate)
return 0;
ret = ops->key_validate(c, k, from);
fsck_err:
return ret;
}
static u64 bch2_key_types_allowed[] = {
[BKEY_TYPE_btree] =
BIT_ULL(KEY_TYPE_deleted)|
BIT_ULL(KEY_TYPE_btree_ptr)|
BIT_ULL(KEY_TYPE_btree_ptr_v2),
#define x(name, nr, flags, keys) [BKEY_TYPE_##name] = BIT_ULL(KEY_TYPE_deleted)|keys,
BCH_BTREE_IDS()
#undef x
};
static const enum bch_bkey_type_flags bch2_bkey_type_flags[] = {
#define x(name, nr, flags) [KEY_TYPE_##name] = flags,
BCH_BKEY_TYPES()
#undef x
};
const char *bch2_btree_node_type_str(enum btree_node_type type)
{
return type == BKEY_TYPE_btree ? "internal btree node" : bch2_btree_id_str(type - 1);
}
int __bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
enum btree_node_type type = __btree_node_type(from.level, from.btree);
if (test_bit(BCH_FS_no_invalid_checks, &c->flags))
return 0;
int ret = 0;
bkey_fsck_err_on(k.k->u64s < BKEY_U64s,
c, bkey_u64s_too_small,
"u64s too small (%u < %zu)", k.k->u64s, BKEY_U64s);
if (type >= BKEY_TYPE_NR)
return 0;
enum bch_bkey_type_flags bkey_flags = k.k->type < KEY_TYPE_MAX
? bch2_bkey_type_flags[k.k->type]
: 0;
bool strict_key_type_allowed =
(from.flags & BCH_VALIDATE_commit) ||
type == BKEY_TYPE_btree ||
(from.btree < BTREE_ID_NR &&
(bkey_flags & BKEY_TYPE_strict_btree_checks));
bkey_fsck_err_on(strict_key_type_allowed &&
k.k->type < KEY_TYPE_MAX &&
!(bch2_key_types_allowed[type] & BIT_ULL(k.k->type)),
c, bkey_invalid_type_for_btree,
"invalid key type for btree %s (%s)",
bch2_btree_node_type_str(type),
k.k->type < KEY_TYPE_MAX
? bch2_bkey_types[k.k->type]
: "(unknown)");
if (btree_node_type_is_extents(type) && !bkey_whiteout(k.k)) {
bkey_fsck_err_on(k.k->size == 0,
c, bkey_extent_size_zero,
"size == 0");
bkey_fsck_err_on(k.k->size > k.k->p.offset,
c, bkey_extent_size_greater_than_offset,
"size greater than offset (%u > %llu)",
k.k->size, k.k->p.offset);
} else {
bkey_fsck_err_on(k.k->size,
c, bkey_size_nonzero,
"size != 0");
}
if (type != BKEY_TYPE_btree) {
enum btree_id btree = type - 1;
if (btree_type_has_snapshots(btree)) {
bkey_fsck_err_on(!k.k->p.snapshot,
c, bkey_snapshot_zero,
"snapshot == 0");
} else if (!btree_type_has_snapshot_field(btree)) {
bkey_fsck_err_on(k.k->p.snapshot,
c, bkey_snapshot_nonzero,
"nonzero snapshot");
} else {
/*
* btree uses snapshot field but it's not required to be
* nonzero
*/
}
bkey_fsck_err_on(bkey_eq(k.k->p, POS_MAX),
c, bkey_at_pos_max,
"key at POS_MAX");
}
fsck_err:
return ret;
}
int bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from)
{
return __bch2_bkey_validate(c, k, from) ?:
bch2_bkey_val_validate(c, k, from);
}
int bch2_bkey_in_btree_node(struct bch_fs *c, struct btree *b,
struct bkey_s_c k,
struct bkey_validate_context from)
{
int ret = 0;
bkey_fsck_err_on(bpos_lt(k.k->p, b->data->min_key),
c, bkey_before_start_of_btree_node,
"key before start of btree node");
bkey_fsck_err_on(bpos_gt(k.k->p, b->data->max_key),
c, bkey_after_end_of_btree_node,
"key past end of btree node");
fsck_err:
return ret;
}
void bch2_bpos_to_text(struct printbuf *out, struct bpos pos)
{
if (bpos_eq(pos, POS_MIN))
prt_printf(out, "POS_MIN");
else if (bpos_eq(pos, POS_MAX))
prt_printf(out, "POS_MAX");
else if (bpos_eq(pos, SPOS_MAX))
prt_printf(out, "SPOS_MAX");
else {
if (pos.inode == U64_MAX)
prt_printf(out, "U64_MAX");
else
prt_printf(out, "%llu", pos.inode);
prt_printf(out, ":");
if (pos.offset == U64_MAX)
prt_printf(out, "U64_MAX");
else
prt_printf(out, "%llu", pos.offset);
prt_printf(out, ":");
if (pos.snapshot == U32_MAX)
prt_printf(out, "U32_MAX");
else
prt_printf(out, "%u", pos.snapshot);
}
}
void bch2_bkey_to_text(struct printbuf *out, const struct bkey *k)
{
if (k) {
prt_printf(out, "u64s %u type ", k->u64s);
if (k->type < KEY_TYPE_MAX)
prt_printf(out, "%s ", bch2_bkey_types[k->type]);
else
prt_printf(out, "%u ", k->type);
bch2_bpos_to_text(out, k->p);
prt_printf(out, " len %u ver %llu", k->size, k->bversion.lo);
} else {
prt_printf(out, "(null)");
}
}
void bch2_val_to_text(struct printbuf *out, struct bch_fs *c,
struct bkey_s_c k)
{
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
if (likely(ops->val_to_text))
ops->val_to_text(out, c, k);
}
void bch2_bkey_val_to_text(struct printbuf *out, struct bch_fs *c,
struct bkey_s_c k)
{
bch2_bkey_to_text(out, k.k);
if (bkey_val_bytes(k.k)) {
prt_printf(out, ": ");
bch2_val_to_text(out, c, k);
}
}
void bch2_bkey_swab_val(struct bkey_s k)
{
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
if (ops->swab)
ops->swab(k);
}
bool bch2_bkey_normalize(struct bch_fs *c, struct bkey_s k)
{
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
return ops->key_normalize
? ops->key_normalize(c, k)
: false;
}
bool bch2_bkey_merge(struct bch_fs *c, struct bkey_s l, struct bkey_s_c r)
{
const struct bkey_ops *ops = bch2_bkey_type_ops(l.k->type);
return ops->key_merge &&
bch2_bkey_maybe_mergable(l.k, r.k) &&
(u64) l.k->size + r.k->size <= KEY_SIZE_MAX &&
!static_branch_unlikely(&bch2_key_merging_disabled) &&
ops->key_merge(c, l, r);
}
static const struct old_bkey_type {
u8 btree_node_type;
u8 old;
u8 new;
} bkey_renumber_table[] = {
{BKEY_TYPE_btree, 128, KEY_TYPE_btree_ptr },
{BKEY_TYPE_extents, 128, KEY_TYPE_extent },
{BKEY_TYPE_extents, 129, KEY_TYPE_extent },
{BKEY_TYPE_extents, 130, KEY_TYPE_reservation },
{BKEY_TYPE_inodes, 128, KEY_TYPE_inode },
{BKEY_TYPE_inodes, 130, KEY_TYPE_inode_generation },
{BKEY_TYPE_dirents, 128, KEY_TYPE_dirent },
{BKEY_TYPE_dirents, 129, KEY_TYPE_hash_whiteout },
{BKEY_TYPE_xattrs, 128, KEY_TYPE_xattr },
{BKEY_TYPE_xattrs, 129, KEY_TYPE_hash_whiteout },
{BKEY_TYPE_alloc, 128, KEY_TYPE_alloc },
{BKEY_TYPE_quotas, 128, KEY_TYPE_quota },
};
void bch2_bkey_renumber(enum btree_node_type btree_node_type,
struct bkey_packed *k,
int write)
{
const struct old_bkey_type *i;
for (i = bkey_renumber_table;
i < bkey_renumber_table + ARRAY_SIZE(bkey_renumber_table);
i++)
if (btree_node_type == i->btree_node_type &&
k->type == (write ? i->new : i->old)) {
k->type = write ? i->old : i->new;
break;
}
}
void __bch2_bkey_compat(unsigned level, enum btree_id btree_id,
unsigned version, unsigned big_endian,
int write,
struct bkey_format *f,
struct bkey_packed *k)
{
const struct bkey_ops *ops;
struct bkey uk;
unsigned nr_compat = 5;
int i;
/*
* Do these operations in reverse order in the write path:
*/
for (i = 0; i < nr_compat; i++)
switch (!write ? i : nr_compat - 1 - i) {
case 0:
if (big_endian != CPU_BIG_ENDIAN) {
bch2_bkey_swab_key(f, k);
} else if (IS_ENABLED(CONFIG_BCACHEFS_DEBUG)) {
bch2_bkey_swab_key(f, k);
bch2_bkey_swab_key(f, k);
}
break;
case 1:
if (version < bcachefs_metadata_version_bkey_renumber)
bch2_bkey_renumber(__btree_node_type(level, btree_id), k, write);
break;
case 2:
if (version < bcachefs_metadata_version_inode_btree_change &&
btree_id == BTREE_ID_inodes) {
if (!bkey_packed(k)) {
struct bkey_i *u = packed_to_bkey(k);
swap(u->k.p.inode, u->k.p.offset);
} else if (f->bits_per_field[BKEY_FIELD_INODE] &&
f->bits_per_field[BKEY_FIELD_OFFSET]) {
struct bkey_format tmp = *f, *in = f, *out = &tmp;
swap(tmp.bits_per_field[BKEY_FIELD_INODE],
tmp.bits_per_field[BKEY_FIELD_OFFSET]);
swap(tmp.field_offset[BKEY_FIELD_INODE],
tmp.field_offset[BKEY_FIELD_OFFSET]);
if (!write)
swap(in, out);
uk = __bch2_bkey_unpack_key(in, k);
swap(uk.p.inode, uk.p.offset);
BUG_ON(!bch2_bkey_pack_key(k, &uk, out));
}
}
break;
case 3:
if (version < bcachefs_metadata_version_snapshot &&
(level || btree_type_has_snapshots(btree_id))) {
struct bkey_i *u = packed_to_bkey(k);
if (u) {
u->k.p.snapshot = write
? 0 : U32_MAX;
} else {
u64 min_packed = le64_to_cpu(f->field_offset[BKEY_FIELD_SNAPSHOT]);
u64 max_packed = min_packed +
~(~0ULL << f->bits_per_field[BKEY_FIELD_SNAPSHOT]);
uk = __bch2_bkey_unpack_key(f, k);
uk.p.snapshot = write
? min_packed : min_t(u64, U32_MAX, max_packed);
BUG_ON(!bch2_bkey_pack_key(k, &uk, f));
}
}
break;
case 4: {
struct bkey_s u;
if (!bkey_packed(k)) {
u = bkey_i_to_s(packed_to_bkey(k));
} else {
uk = __bch2_bkey_unpack_key(f, k);
u.k = &uk;
u.v = bkeyp_val(f, k);
}
if (big_endian != CPU_BIG_ENDIAN)
bch2_bkey_swab_val(u);
ops = bch2_bkey_type_ops(k->type);
if (ops->compat)
ops->compat(btree_id, version, big_endian, write, u);
break;
}
default:
BUG();
}
}

View File

@ -1,139 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_METHODS_H
#define _BCACHEFS_BKEY_METHODS_H
#include "bkey.h"
struct bch_fs;
struct btree;
struct btree_trans;
struct bkey;
enum btree_node_type;
extern const char * const bch2_bkey_types[];
extern const struct bkey_ops bch2_bkey_null_ops;
/*
* key_validate: checks validity of @k, returns 0 if good or -EINVAL if bad. If
* invalid, entire key will be deleted.
*
* When invalid, error string is returned via @err. @rw indicates whether key is
* being read or written; more aggressive checks can be enabled when rw == WRITE.
*/
struct bkey_ops {
int (*key_validate)(struct bch_fs *c, struct bkey_s_c k,
struct bkey_validate_context from);
void (*val_to_text)(struct printbuf *, struct bch_fs *,
struct bkey_s_c);
void (*swab)(struct bkey_s);
bool (*key_normalize)(struct bch_fs *, struct bkey_s);
bool (*key_merge)(struct bch_fs *, struct bkey_s, struct bkey_s_c);
int (*trigger)(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_s,
enum btree_iter_update_trigger_flags);
void (*compat)(enum btree_id id, unsigned version,
unsigned big_endian, int write,
struct bkey_s);
/* Size of value type when first created: */
unsigned min_val_size;
};
extern const struct bkey_ops bch2_bkey_ops[];
static inline const struct bkey_ops *bch2_bkey_type_ops(enum bch_bkey_type type)
{
return likely(type < KEY_TYPE_MAX)
? &bch2_bkey_ops[type]
: &bch2_bkey_null_ops;
}
int bch2_bkey_val_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int __bch2_bkey_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int bch2_bkey_validate(struct bch_fs *, struct bkey_s_c,
struct bkey_validate_context);
int bch2_bkey_in_btree_node(struct bch_fs *, struct btree *, struct bkey_s_c,
struct bkey_validate_context from);
void bch2_bpos_to_text(struct printbuf *, struct bpos);
void bch2_bkey_to_text(struct printbuf *, const struct bkey *);
void bch2_val_to_text(struct printbuf *, struct bch_fs *,
struct bkey_s_c);
void bch2_bkey_val_to_text(struct printbuf *, struct bch_fs *,
struct bkey_s_c);
void bch2_bkey_swab_val(struct bkey_s);
bool bch2_bkey_normalize(struct bch_fs *, struct bkey_s);
static inline bool bch2_bkey_maybe_mergable(const struct bkey *l, const struct bkey *r)
{
return l->type == r->type &&
!bversion_cmp(l->bversion, r->bversion) &&
bpos_eq(l->p, bkey_start_pos(r));
}
bool bch2_bkey_merge(struct bch_fs *, struct bkey_s, struct bkey_s_c);
static inline int bch2_key_trigger(struct btree_trans *trans,
enum btree_id btree, unsigned level,
struct bkey_s_c old, struct bkey_s new,
enum btree_iter_update_trigger_flags flags)
{
const struct bkey_ops *ops = bch2_bkey_type_ops(old.k->type ?: new.k->type);
return ops->trigger
? ops->trigger(trans, btree, level, old, new, flags)
: 0;
}
static inline int bch2_key_trigger_old(struct btree_trans *trans,
enum btree_id btree_id, unsigned level,
struct bkey_s_c old,
enum btree_iter_update_trigger_flags flags)
{
struct bkey_i deleted;
bkey_init(&deleted.k);
deleted.k.p = old.k->p;
return bch2_key_trigger(trans, btree_id, level, old, bkey_i_to_s(&deleted),
BTREE_TRIGGER_overwrite|flags);
}
static inline int bch2_key_trigger_new(struct btree_trans *trans,
enum btree_id btree_id, unsigned level,
struct bkey_s new,
enum btree_iter_update_trigger_flags flags)
{
struct bkey_i deleted;
bkey_init(&deleted.k);
deleted.k.p = new.k->p;
return bch2_key_trigger(trans, btree_id, level, bkey_i_to_s_c(&deleted), new,
BTREE_TRIGGER_insert|flags);
}
void bch2_bkey_renumber(enum btree_node_type, struct bkey_packed *, int);
void __bch2_bkey_compat(unsigned, enum btree_id, unsigned, unsigned,
int, struct bkey_format *, struct bkey_packed *);
static inline void bch2_bkey_compat(unsigned level, enum btree_id btree_id,
unsigned version, unsigned big_endian,
int write,
struct bkey_format *f,
struct bkey_packed *k)
{
if (version < bcachefs_metadata_version_current ||
big_endian != CPU_BIG_ENDIAN ||
IS_ENABLED(CONFIG_BCACHEFS_DEBUG))
__bch2_bkey_compat(level, btree_id, version,
big_endian, write, f, k);
}
#endif /* _BCACHEFS_BKEY_METHODS_H */

View File

@ -1,214 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "bkey_buf.h"
#include "bkey_cmp.h"
#include "bkey_sort.h"
#include "bset.h"
#include "extents.h"
typedef int (*sort_cmp_fn)(const struct btree *,
const struct bkey_packed *,
const struct bkey_packed *);
static inline bool sort_iter_end(struct sort_iter *iter)
{
return !iter->used;
}
static inline void sort_iter_sift(struct sort_iter *iter, unsigned from,
sort_cmp_fn cmp)
{
unsigned i;
for (i = from;
i + 1 < iter->used &&
cmp(iter->b, iter->data[i].k, iter->data[i + 1].k) > 0;
i++)
swap(iter->data[i], iter->data[i + 1]);
}
static inline void sort_iter_sort(struct sort_iter *iter, sort_cmp_fn cmp)
{
unsigned i = iter->used;
while (i--)
sort_iter_sift(iter, i, cmp);
}
static inline struct bkey_packed *sort_iter_peek(struct sort_iter *iter)
{
return !sort_iter_end(iter) ? iter->data->k : NULL;
}
static inline void sort_iter_advance(struct sort_iter *iter, sort_cmp_fn cmp)
{
struct sort_iter_set *i = iter->data;
BUG_ON(!iter->used);
i->k = bkey_p_next(i->k);
BUG_ON(i->k > i->end);
if (i->k == i->end)
array_remove_item(iter->data, iter->used, 0);
else
sort_iter_sift(iter, 0, cmp);
}
static inline struct bkey_packed *sort_iter_next(struct sort_iter *iter,
sort_cmp_fn cmp)
{
struct bkey_packed *ret = sort_iter_peek(iter);
if (ret)
sort_iter_advance(iter, cmp);
return ret;
}
/*
* If keys compare equal, compare by pointer order:
*/
static inline int key_sort_fix_overlapping_cmp(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r)
{
return bch2_bkey_cmp_packed(b, l, r) ?:
cmp_int((unsigned long) l, (unsigned long) r);
}
static inline bool should_drop_next_key(struct sort_iter *iter)
{
/*
* key_sort_cmp() ensures that when keys compare equal the older key
* comes first; so if l->k compares equal to r->k then l->k is older
* and should be dropped.
*/
return iter->used >= 2 &&
!bch2_bkey_cmp_packed(iter->b,
iter->data[0].k,
iter->data[1].k);
}
struct btree_nr_keys
bch2_key_sort_fix_overlapping(struct bch_fs *c, struct bset *dst,
struct sort_iter *iter)
{
struct bkey_packed *out = dst->start;
struct bkey_packed *k;
struct btree_nr_keys nr;
memset(&nr, 0, sizeof(nr));
sort_iter_sort(iter, key_sort_fix_overlapping_cmp);
while ((k = sort_iter_peek(iter))) {
if (!bkey_deleted(k) &&
!should_drop_next_key(iter)) {
bkey_p_copy(out, k);
btree_keys_account_key_add(&nr, 0, out);
out = bkey_p_next(out);
}
sort_iter_advance(iter, key_sort_fix_overlapping_cmp);
}
dst->u64s = cpu_to_le16((u64 *) out - dst->_data);
return nr;
}
/* Sort + repack in a new format: */
struct btree_nr_keys
bch2_sort_repack(struct bset *dst, struct btree *src,
struct btree_node_iter *src_iter,
struct bkey_format *out_f,
bool filter_whiteouts)
{
struct bkey_format *in_f = &src->format;
struct bkey_packed *in, *out = vstruct_last(dst);
struct btree_nr_keys nr;
bool transform = memcmp(out_f, &src->format, sizeof(*out_f));
memset(&nr, 0, sizeof(nr));
while ((in = bch2_btree_node_iter_next_all(src_iter, src))) {
if (filter_whiteouts && bkey_deleted(in))
continue;
if (!transform)
bkey_p_copy(out, in);
else if (bch2_bkey_transform(out_f, out, bkey_packed(in)
? in_f : &bch2_bkey_format_current, in))
out->format = KEY_FORMAT_LOCAL_BTREE;
else
bch2_bkey_unpack(src, (void *) out, in);
out->needs_whiteout = false;
btree_keys_account_key_add(&nr, 0, out);
out = bkey_p_next(out);
}
dst->u64s = cpu_to_le16((u64 *) out - dst->_data);
return nr;
}
static inline int keep_unwritten_whiteouts_cmp(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r)
{
return bch2_bkey_cmp_packed_inlined(b, l, r) ?:
(int) bkey_deleted(r) - (int) bkey_deleted(l) ?:
(long) l - (long) r;
}
#include "btree_update_interior.h"
/*
* For sorting in the btree node write path: whiteouts not in the unwritten
* whiteouts area are dropped, whiteouts in the unwritten whiteouts area are
* dropped if overwritten by real keys:
*/
unsigned bch2_sort_keys_keep_unwritten_whiteouts(struct bkey_packed *dst, struct sort_iter *iter)
{
struct bkey_packed *in, *next, *out = dst;
sort_iter_sort(iter, keep_unwritten_whiteouts_cmp);
while ((in = sort_iter_next(iter, keep_unwritten_whiteouts_cmp))) {
if (bkey_deleted(in) && in < unwritten_whiteouts_start(iter->b))
continue;
if ((next = sort_iter_peek(iter)) &&
!bch2_bkey_cmp_packed_inlined(iter->b, in, next))
continue;
bkey_p_copy(out, in);
out = bkey_p_next(out);
}
return (u64 *) out - (u64 *) dst;
}
/*
* Main sort routine for compacting a btree node in memory: we always drop
* whiteouts because any whiteouts that need to be written are in the unwritten
* whiteouts area:
*/
unsigned bch2_sort_keys(struct bkey_packed *dst, struct sort_iter *iter)
{
struct bkey_packed *in, *out = dst;
sort_iter_sort(iter, bch2_bkey_cmp_packed_inlined);
while ((in = sort_iter_next(iter, bch2_bkey_cmp_packed_inlined))) {
if (bkey_deleted(in))
continue;
bkey_p_copy(out, in);
out = bkey_p_next(out);
}
return (u64 *) out - (u64 *) dst;
}

View File

@ -1,54 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_SORT_H
#define _BCACHEFS_BKEY_SORT_H
struct sort_iter {
struct btree *b;
unsigned used;
unsigned size;
struct sort_iter_set {
struct bkey_packed *k, *end;
} data[];
};
static inline void sort_iter_init(struct sort_iter *iter, struct btree *b, unsigned size)
{
iter->b = b;
iter->used = 0;
iter->size = size;
}
struct sort_iter_stack {
struct sort_iter iter;
struct sort_iter_set sets[MAX_BSETS + 1];
};
static inline void sort_iter_stack_init(struct sort_iter_stack *iter, struct btree *b)
{
sort_iter_init(&iter->iter, b, ARRAY_SIZE(iter->sets));
}
static inline void sort_iter_add(struct sort_iter *iter,
struct bkey_packed *k,
struct bkey_packed *end)
{
BUG_ON(iter->used >= iter->size);
if (k != end)
iter->data[iter->used++] = (struct sort_iter_set) { k, end };
}
struct btree_nr_keys
bch2_key_sort_fix_overlapping(struct bch_fs *, struct bset *,
struct sort_iter *);
struct btree_nr_keys
bch2_sort_repack(struct bset *, struct btree *,
struct btree_node_iter *,
struct bkey_format *, bool);
unsigned bch2_sort_keys_keep_unwritten_whiteouts(struct bkey_packed *, struct sort_iter *);
unsigned bch2_sort_keys(struct bkey_packed *, struct sort_iter *);
#endif /* _BCACHEFS_BKEY_SORT_H */

View File

@ -1,241 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BKEY_TYPES_H
#define _BCACHEFS_BKEY_TYPES_H
#include "bcachefs_format.h"
/*
* bkey_i - bkey with inline value
* bkey_s - bkey with split value
* bkey_s_c - bkey with split value, const
*/
#define bkey_p_next(_k) vstruct_next(_k)
static inline struct bkey_i *bkey_next(struct bkey_i *k)
{
return (struct bkey_i *) ((u64 *) k->_data + k->k.u64s);
}
#define bkey_val_u64s(_k) ((_k)->u64s - BKEY_U64s)
static inline size_t bkey_val_bytes(const struct bkey *k)
{
return bkey_val_u64s(k) * sizeof(u64);
}
static inline void set_bkey_val_u64s(struct bkey *k, unsigned val_u64s)
{
unsigned u64s = BKEY_U64s + val_u64s;
BUG_ON(u64s > U8_MAX);
k->u64s = u64s;
}
static inline void set_bkey_val_bytes(struct bkey *k, unsigned bytes)
{
set_bkey_val_u64s(k, DIV_ROUND_UP(bytes, sizeof(u64)));
}
#define bkey_val_end(_k) ((void *) (((u64 *) (_k).v) + bkey_val_u64s((_k).k)))
#define bkey_deleted(_k) ((_k)->type == KEY_TYPE_deleted)
#define bkey_whiteout(_k) \
((_k)->type == KEY_TYPE_deleted || (_k)->type == KEY_TYPE_whiteout)
/* bkey with split value, const */
struct bkey_s_c {
const struct bkey *k;
const struct bch_val *v;
};
/* bkey with split value */
struct bkey_s {
union {
struct {
struct bkey *k;
struct bch_val *v;
};
struct bkey_s_c s_c;
};
};
#define bkey_s_null ((struct bkey_s) { .k = NULL })
#define bkey_s_c_null ((struct bkey_s_c) { .k = NULL })
#define bkey_s_err(err) ((struct bkey_s) { .k = ERR_PTR(err) })
#define bkey_s_c_err(err) ((struct bkey_s_c) { .k = ERR_PTR(err) })
static inline struct bkey_s bkey_to_s(struct bkey *k)
{
return (struct bkey_s) { .k = k, .v = NULL };
}
static inline struct bkey_s_c bkey_to_s_c(const struct bkey *k)
{
return (struct bkey_s_c) { .k = k, .v = NULL };
}
static inline struct bkey_s bkey_i_to_s(struct bkey_i *k)
{
return (struct bkey_s) { .k = &k->k, .v = &k->v };
}
static inline struct bkey_s_c bkey_i_to_s_c(const struct bkey_i *k)
{
return (struct bkey_s_c) { .k = &k->k, .v = &k->v };
}
/*
* For a given type of value (e.g. struct bch_extent), generates the types for
* bkey + bch_extent - inline, split, split const - and also all the conversion
* functions, which also check that the value is of the correct type.
*
* We use anonymous unions for upcasting - e.g. converting from e.g. a
* bkey_i_extent to a bkey_i - since that's always safe, instead of conversion
* functions.
*/
#define x(name, ...) \
struct bkey_i_##name { \
union { \
struct bkey k; \
struct bkey_i k_i; \
}; \
struct bch_##name v; \
}; \
\
struct bkey_s_c_##name { \
union { \
struct { \
const struct bkey *k; \
const struct bch_##name *v; \
}; \
struct bkey_s_c s_c; \
}; \
}; \
\
struct bkey_s_##name { \
union { \
struct { \
struct bkey *k; \
struct bch_##name *v; \
}; \
struct bkey_s_c_##name c; \
struct bkey_s s; \
struct bkey_s_c s_c; \
}; \
}; \
\
static inline struct bkey_i_##name *bkey_i_to_##name(struct bkey_i *k) \
{ \
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
return container_of(&k->k, struct bkey_i_##name, k); \
} \
\
static inline const struct bkey_i_##name * \
bkey_i_to_##name##_c(const struct bkey_i *k) \
{ \
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
return container_of(&k->k, struct bkey_i_##name, k); \
} \
\
static inline struct bkey_s_##name bkey_s_to_##name(struct bkey_s k) \
{ \
EBUG_ON(!IS_ERR_OR_NULL(k.k) && k.k->type != KEY_TYPE_##name); \
return (struct bkey_s_##name) { \
.k = k.k, \
.v = container_of(k.v, struct bch_##name, v), \
}; \
} \
\
static inline struct bkey_s_c_##name bkey_s_c_to_##name(struct bkey_s_c k)\
{ \
EBUG_ON(!IS_ERR_OR_NULL(k.k) && k.k->type != KEY_TYPE_##name); \
return (struct bkey_s_c_##name) { \
.k = k.k, \
.v = container_of(k.v, struct bch_##name, v), \
}; \
} \
\
static inline struct bkey_s_##name name##_i_to_s(struct bkey_i_##name *k)\
{ \
return (struct bkey_s_##name) { \
.k = &k->k, \
.v = &k->v, \
}; \
} \
\
static inline struct bkey_s_c_##name \
name##_i_to_s_c(const struct bkey_i_##name *k) \
{ \
return (struct bkey_s_c_##name) { \
.k = &k->k, \
.v = &k->v, \
}; \
} \
\
static inline struct bkey_s_##name bkey_i_to_s_##name(struct bkey_i *k) \
{ \
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
return (struct bkey_s_##name) { \
.k = &k->k, \
.v = container_of(&k->v, struct bch_##name, v), \
}; \
} \
\
static inline struct bkey_s_c_##name \
bkey_i_to_s_c_##name(const struct bkey_i *k) \
{ \
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
return (struct bkey_s_c_##name) { \
.k = &k->k, \
.v = container_of(&k->v, struct bch_##name, v), \
}; \
} \
\
static inline struct bkey_i_##name *bkey_##name##_init(struct bkey_i *_k)\
{ \
struct bkey_i_##name *k = \
container_of(&_k->k, struct bkey_i_##name, k); \
\
bkey_init(&k->k); \
memset(&k->v, 0, sizeof(k->v)); \
k->k.type = KEY_TYPE_##name; \
set_bkey_val_bytes(&k->k, sizeof(k->v)); \
\
return k; \
}
BCH_BKEY_TYPES();
#undef x
enum bch_validate_flags {
BCH_VALIDATE_write = BIT(0),
BCH_VALIDATE_commit = BIT(1),
BCH_VALIDATE_silent = BIT(2),
};
#define BKEY_VALIDATE_CONTEXTS() \
x(unknown) \
x(superblock) \
x(journal) \
x(btree_root) \
x(btree_node) \
x(commit)
struct bkey_validate_context {
enum {
#define x(n) BKEY_VALIDATE_##n,
BKEY_VALIDATE_CONTEXTS()
#undef x
} from:8;
enum bch_validate_flags flags:8;
u8 level;
enum btree_id btree;
bool root:1;
unsigned journal_offset;
u64 journal_seq;
};
#endif /* _BCACHEFS_BKEY_TYPES_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,536 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BSET_H
#define _BCACHEFS_BSET_H
#include <linux/kernel.h>
#include <linux/types.h>
#include "bcachefs.h"
#include "bkey.h"
#include "bkey_methods.h"
#include "btree_types.h"
#include "util.h" /* for time_stats */
#include "vstructs.h"
/*
* BKEYS:
*
* A bkey contains a key, a size field, a variable number of pointers, and some
* ancillary flag bits.
*
* We use two different functions for validating bkeys, bkey_invalid and
* bkey_deleted().
*
* The one exception to the rule that ptr_invalid() filters out invalid keys is
* that it also filters out keys of size 0 - these are keys that have been
* completely overwritten. It'd be safe to delete these in memory while leaving
* them on disk, just unnecessary work - so we filter them out when resorting
* instead.
*
* We can't filter out stale keys when we're resorting, because garbage
* collection needs to find them to ensure bucket gens don't wrap around -
* unless we're rewriting the btree node those stale keys still exist on disk.
*
* We also implement functions here for removing some number of sectors from the
* front or the back of a bkey - this is mainly used for fixing overlapping
* extents, by removing the overlapping sectors from the older key.
*
* BSETS:
*
* A bset is an array of bkeys laid out contiguously in memory in sorted order,
* along with a header. A btree node is made up of a number of these, written at
* different times.
*
* There could be many of them on disk, but we never allow there to be more than
* 4 in memory - we lazily resort as needed.
*
* We implement code here for creating and maintaining auxiliary search trees
* (described below) for searching an individial bset, and on top of that we
* implement a btree iterator.
*
* BTREE ITERATOR:
*
* Most of the code in bcache doesn't care about an individual bset - it needs
* to search entire btree nodes and iterate over them in sorted order.
*
* The btree iterator code serves both functions; it iterates through the keys
* in a btree node in sorted order, starting from either keys after a specific
* point (if you pass it a search key) or the start of the btree node.
*
* AUXILIARY SEARCH TREES:
*
* Since keys are variable length, we can't use a binary search on a bset - we
* wouldn't be able to find the start of the next key. But binary searches are
* slow anyways, due to terrible cache behaviour; bcache originally used binary
* searches and that code topped out at under 50k lookups/second.
*
* So we need to construct some sort of lookup table. Since we only insert keys
* into the last (unwritten) set, most of the keys within a given btree node are
* usually in sets that are mostly constant. We use two different types of
* lookup tables to take advantage of this.
*
* Both lookup tables share in common that they don't index every key in the
* set; they index one key every BSET_CACHELINE bytes, and then a linear search
* is used for the rest.
*
* For sets that have been written to disk and are no longer being inserted
* into, we construct a binary search tree in an array - traversing a binary
* search tree in an array gives excellent locality of reference and is very
* fast, since both children of any node are adjacent to each other in memory
* (and their grandchildren, and great grandchildren...) - this means
* prefetching can be used to great effect.
*
* It's quite useful performance wise to keep these nodes small - not just
* because they're more likely to be in L2, but also because we can prefetch
* more nodes on a single cacheline and thus prefetch more iterations in advance
* when traversing this tree.
*
* Nodes in the auxiliary search tree must contain both a key to compare against
* (we don't want to fetch the key from the set, that would defeat the purpose),
* and a pointer to the key. We use a few tricks to compress both of these.
*
* To compress the pointer, we take advantage of the fact that one node in the
* search tree corresponds to precisely BSET_CACHELINE bytes in the set. We have
* a function (to_inorder()) that takes the index of a node in a binary tree and
* returns what its index would be in an inorder traversal, so we only have to
* store the low bits of the offset.
*
* The key is 84 bits (KEY_DEV + key->key, the offset on the device). To
* compress that, we take advantage of the fact that when we're traversing the
* search tree at every iteration we know that both our search key and the key
* we're looking for lie within some range - bounded by our previous
* comparisons. (We special case the start of a search so that this is true even
* at the root of the tree).
*
* So we know the key we're looking for is between a and b, and a and b don't
* differ higher than bit 50, we don't need to check anything higher than bit
* 50.
*
* We don't usually need the rest of the bits, either; we only need enough bits
* to partition the key range we're currently checking. Consider key n - the
* key our auxiliary search tree node corresponds to, and key p, the key
* immediately preceding n. The lowest bit we need to store in the auxiliary
* search tree is the highest bit that differs between n and p.
*
* Note that this could be bit 0 - we might sometimes need all 80 bits to do the
* comparison. But we'd really like our nodes in the auxiliary search tree to be
* of fixed size.
*
* The solution is to make them fixed size, and when we're constructing a node
* check if p and n differed in the bits we needed them to. If they don't we
* flag that node, and when doing lookups we fallback to comparing against the
* real key. As long as this doesn't happen to often (and it seems to reliably
* happen a bit less than 1% of the time), we win - even on failures, that key
* is then more likely to be in cache than if we were doing binary searches all
* the way, since we're touching so much less memory.
*
* The keys in the auxiliary search tree are stored in (software) floating
* point, with an exponent and a mantissa. The exponent needs to be big enough
* to address all the bits in the original key, but the number of bits in the
* mantissa is somewhat arbitrary; more bits just gets us fewer failures.
*
* We need 7 bits for the exponent and 3 bits for the key's offset (since keys
* are 8 byte aligned); using 22 bits for the mantissa means a node is 4 bytes.
* We need one node per 128 bytes in the btree node, which means the auxiliary
* search trees take up 3% as much memory as the btree itself.
*
* Constructing these auxiliary search trees is moderately expensive, and we
* don't want to be constantly rebuilding the search tree for the last set
* whenever we insert another key into it. For the unwritten set, we use a much
* simpler lookup table - it's just a flat array, so index i in the lookup table
* corresponds to the i range of BSET_CACHELINE bytes in the set. Indexing
* within each byte range works the same as with the auxiliary search trees.
*
* These are much easier to keep up to date when we insert a key - we do it
* somewhat lazily; when we shift a key up we usually just increment the pointer
* to it, only when it would overflow do we go to the trouble of finding the
* first key in that range of bytes again.
*/
enum bset_aux_tree_type {
BSET_NO_AUX_TREE,
BSET_RO_AUX_TREE,
BSET_RW_AUX_TREE,
};
#define BSET_TREE_NR_TYPES 3
#define BSET_NO_AUX_TREE_VAL (U16_MAX)
#define BSET_RW_AUX_TREE_VAL (U16_MAX - 1)
static inline enum bset_aux_tree_type bset_aux_tree_type(const struct bset_tree *t)
{
switch (t->extra) {
case BSET_NO_AUX_TREE_VAL:
EBUG_ON(t->size);
return BSET_NO_AUX_TREE;
case BSET_RW_AUX_TREE_VAL:
EBUG_ON(!t->size);
return BSET_RW_AUX_TREE;
default:
EBUG_ON(!t->size);
return BSET_RO_AUX_TREE;
}
}
/*
* BSET_CACHELINE was originally intended to match the hardware cacheline size -
* it used to be 64, but I realized the lookup code would touch slightly less
* memory if it was 128.
*
* It definites the number of bytes (in struct bset) per struct bkey_float in
* the auxiliar search tree - when we're done searching the bset_float tree we
* have this many bytes left that we do a linear search over.
*
* Since (after level 5) every level of the bset_tree is on a new cacheline,
* we're touching one fewer cacheline in the bset tree in exchange for one more
* cacheline in the linear search - but the linear search might stop before it
* gets to the second cacheline.
*/
#define BSET_CACHELINE 256
static inline size_t btree_keys_cachelines(const struct btree *b)
{
return (1U << b->byte_order) / BSET_CACHELINE;
}
static inline size_t btree_aux_data_bytes(const struct btree *b)
{
return btree_keys_cachelines(b) * 8;
}
static inline size_t btree_aux_data_u64s(const struct btree *b)
{
return btree_aux_data_bytes(b) / sizeof(u64);
}
#define for_each_bset(_b, _t) \
for (struct bset_tree *_t = (_b)->set; _t < (_b)->set + (_b)->nsets; _t++)
#define for_each_bset_c(_b, _t) \
for (const struct bset_tree *_t = (_b)->set; _t < (_b)->set + (_b)->nsets; _t++)
#define bset_tree_for_each_key(_b, _t, _k) \
for (_k = btree_bkey_first(_b, _t); \
_k != btree_bkey_last(_b, _t); \
_k = bkey_p_next(_k))
static inline bool bset_has_ro_aux_tree(const struct bset_tree *t)
{
return bset_aux_tree_type(t) == BSET_RO_AUX_TREE;
}
static inline bool bset_has_rw_aux_tree(struct bset_tree *t)
{
return bset_aux_tree_type(t) == BSET_RW_AUX_TREE;
}
static inline void bch2_bset_set_no_aux_tree(struct btree *b,
struct bset_tree *t)
{
BUG_ON(t < b->set);
for (; t < b->set + ARRAY_SIZE(b->set); t++) {
t->size = 0;
t->extra = BSET_NO_AUX_TREE_VAL;
t->aux_data_offset = U16_MAX;
}
}
static inline void btree_node_set_format(struct btree *b,
struct bkey_format f)
{
int len;
b->format = f;
b->nr_key_bits = bkey_format_key_bits(&f);
len = bch2_compile_bkey_format(&b->format, b->aux_data);
BUG_ON(len < 0 || len > U8_MAX);
b->unpack_fn_len = len;
bch2_bset_set_no_aux_tree(b, b->set);
}
static inline struct bset *bset_next_set(struct btree *b,
unsigned block_bytes)
{
struct bset *i = btree_bset_last(b);
EBUG_ON(!is_power_of_2(block_bytes));
return ((void *) i) + round_up(vstruct_bytes(i), block_bytes);
}
void bch2_btree_keys_init(struct btree *);
void bch2_bset_init_first(struct btree *, struct bset *);
void bch2_bset_init_next(struct btree *, struct btree_node_entry *);
void bch2_bset_build_aux_tree(struct btree *, struct bset_tree *, bool);
void bch2_bset_insert(struct btree *, struct bkey_packed *, struct bkey_i *,
unsigned);
void bch2_bset_delete(struct btree *, struct bkey_packed *, unsigned);
/* Bkey utility code */
/* packed or unpacked */
static inline int bkey_cmp_p_or_unp(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r_packed,
const struct bpos *r)
{
EBUG_ON(r_packed && !bkey_packed(r_packed));
if (unlikely(!bkey_packed(l)))
return bpos_cmp(packed_to_bkey_c(l)->p, *r);
if (likely(r_packed))
return __bch2_bkey_cmp_packed_format_checked(l, r_packed, b);
return __bch2_bkey_cmp_left_packed_format_checked(b, l, r);
}
static inline struct bset_tree *
bch2_bkey_to_bset_inlined(struct btree *b, struct bkey_packed *k)
{
unsigned offset = __btree_node_key_to_offset(b, k);
for_each_bset(b, t)
if (offset <= t->end_offset) {
EBUG_ON(offset < btree_bkey_first_offset(t));
return t;
}
BUG();
}
struct bset_tree *bch2_bkey_to_bset(struct btree *, struct bkey_packed *);
struct bkey_packed *bch2_bkey_prev_filter(struct btree *, struct bset_tree *,
struct bkey_packed *, unsigned);
static inline struct bkey_packed *
bch2_bkey_prev_all(struct btree *b, struct bset_tree *t, struct bkey_packed *k)
{
return bch2_bkey_prev_filter(b, t, k, 0);
}
static inline struct bkey_packed *
bch2_bkey_prev(struct btree *b, struct bset_tree *t, struct bkey_packed *k)
{
return bch2_bkey_prev_filter(b, t, k, 1);
}
/* Btree key iteration */
void bch2_btree_node_iter_push(struct btree_node_iter *, struct btree *,
const struct bkey_packed *,
const struct bkey_packed *);
void bch2_btree_node_iter_init(struct btree_node_iter *, struct btree *,
struct bpos *);
void bch2_btree_node_iter_init_from_start(struct btree_node_iter *,
struct btree *);
struct bkey_packed *bch2_btree_node_iter_bset_pos(struct btree_node_iter *,
struct btree *,
struct bset_tree *);
void bch2_btree_node_iter_sort(struct btree_node_iter *, struct btree *);
void bch2_btree_node_iter_set_drop(struct btree_node_iter *,
struct btree_node_iter_set *);
void bch2_btree_node_iter_advance(struct btree_node_iter *, struct btree *);
#define btree_node_iter_for_each(_iter, _set) \
for (_set = (_iter)->data; \
_set < (_iter)->data + ARRAY_SIZE((_iter)->data) && \
(_set)->k != (_set)->end; \
_set++)
static inline bool __btree_node_iter_set_end(struct btree_node_iter *iter,
unsigned i)
{
return iter->data[i].k == iter->data[i].end;
}
static inline bool bch2_btree_node_iter_end(struct btree_node_iter *iter)
{
return __btree_node_iter_set_end(iter, 0);
}
/*
* When keys compare equal, deleted keys compare first:
*
* XXX: only need to compare pointers for keys that are both within a
* btree_node_iterator - we need to break ties for prev() to work correctly
*/
static inline int bkey_iter_cmp(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r)
{
return bch2_bkey_cmp_packed(b, l, r)
?: (int) bkey_deleted(r) - (int) bkey_deleted(l)
?: cmp_int(l, r);
}
static inline int btree_node_iter_cmp(const struct btree *b,
struct btree_node_iter_set l,
struct btree_node_iter_set r)
{
return bkey_iter_cmp(b,
__btree_node_offset_to_key(b, l.k),
__btree_node_offset_to_key(b, r.k));
}
/* These assume r (the search key) is not a deleted key: */
static inline int bkey_iter_pos_cmp(const struct btree *b,
const struct bkey_packed *l,
const struct bpos *r)
{
return bkey_cmp_left_packed(b, l, r)
?: -((int) bkey_deleted(l));
}
static inline int bkey_iter_cmp_p_or_unp(const struct btree *b,
const struct bkey_packed *l,
const struct bkey_packed *r_packed,
const struct bpos *r)
{
return bkey_cmp_p_or_unp(b, l, r_packed, r)
?: -((int) bkey_deleted(l));
}
static inline struct bkey_packed *
__bch2_btree_node_iter_peek_all(struct btree_node_iter *iter,
struct btree *b)
{
return __btree_node_offset_to_key(b, iter->data->k);
}
static inline struct bkey_packed *
bch2_btree_node_iter_peek_all(struct btree_node_iter *iter, struct btree *b)
{
return !bch2_btree_node_iter_end(iter)
? __btree_node_offset_to_key(b, iter->data->k)
: NULL;
}
static inline struct bkey_packed *
bch2_btree_node_iter_peek(struct btree_node_iter *iter, struct btree *b)
{
struct bkey_packed *k;
while ((k = bch2_btree_node_iter_peek_all(iter, b)) &&
bkey_deleted(k))
bch2_btree_node_iter_advance(iter, b);
return k;
}
static inline struct bkey_packed *
bch2_btree_node_iter_next_all(struct btree_node_iter *iter, struct btree *b)
{
struct bkey_packed *ret = bch2_btree_node_iter_peek_all(iter, b);
if (ret)
bch2_btree_node_iter_advance(iter, b);
return ret;
}
struct bkey_packed *bch2_btree_node_iter_prev_all(struct btree_node_iter *,
struct btree *);
struct bkey_packed *bch2_btree_node_iter_prev(struct btree_node_iter *,
struct btree *);
struct bkey_s_c bch2_btree_node_iter_peek_unpack(struct btree_node_iter *,
struct btree *,
struct bkey *);
#define for_each_btree_node_key(b, k, iter) \
for (bch2_btree_node_iter_init_from_start((iter), (b)); \
(k = bch2_btree_node_iter_peek((iter), (b))); \
bch2_btree_node_iter_advance(iter, b))
#define for_each_btree_node_key_unpack(b, k, iter, unpacked) \
for (bch2_btree_node_iter_init_from_start((iter), (b)); \
(k = bch2_btree_node_iter_peek_unpack((iter), (b), (unpacked))).k;\
bch2_btree_node_iter_advance(iter, b))
/* Accounting: */
struct btree_nr_keys bch2_btree_node_count_keys(struct btree *);
static inline void btree_keys_account_key(struct btree_nr_keys *n,
unsigned bset,
struct bkey_packed *k,
int sign)
{
n->live_u64s += k->u64s * sign;
n->bset_u64s[bset] += k->u64s * sign;
if (bkey_packed(k))
n->packed_keys += sign;
else
n->unpacked_keys += sign;
}
static inline void btree_keys_account_val_delta(struct btree *b,
struct bkey_packed *k,
int delta)
{
struct bset_tree *t = bch2_bkey_to_bset(b, k);
b->nr.live_u64s += delta;
b->nr.bset_u64s[t - b->set] += delta;
}
#define btree_keys_account_key_add(_nr, _bset_idx, _k) \
btree_keys_account_key(_nr, _bset_idx, _k, 1)
#define btree_keys_account_key_drop(_nr, _bset_idx, _k) \
btree_keys_account_key(_nr, _bset_idx, _k, -1)
#define btree_account_key_add(_b, _k) \
btree_keys_account_key(&(_b)->nr, \
bch2_bkey_to_bset(_b, _k) - (_b)->set, _k, 1)
#define btree_account_key_drop(_b, _k) \
btree_keys_account_key(&(_b)->nr, \
bch2_bkey_to_bset(_b, _k) - (_b)->set, _k, -1)
struct bset_stats {
struct {
size_t nr, bytes;
} sets[BSET_TREE_NR_TYPES];
size_t floats;
size_t failed;
};
void bch2_btree_keys_stats(const struct btree *, struct bset_stats *);
void bch2_bfloat_to_text(struct printbuf *, struct btree *,
struct bkey_packed *);
/* Debug stuff */
void bch2_dump_bset(struct bch_fs *, struct btree *, struct bset *, unsigned);
void bch2_dump_btree_node(struct bch_fs *, struct btree *);
void bch2_dump_btree_node_iter(struct btree *, struct btree_node_iter *);
void __bch2_verify_btree_nr_keys(struct btree *);
void __bch2_btree_node_iter_verify(struct btree_node_iter *, struct btree *);
static inline void bch2_btree_node_iter_verify(struct btree_node_iter *iter,
struct btree *b)
{
if (static_branch_unlikely(&bch2_debug_check_bset_lookups))
__bch2_btree_node_iter_verify(iter, b);
}
static inline void bch2_verify_btree_nr_keys(struct btree *b)
{
if (static_branch_unlikely(&bch2_debug_check_btree_accounting))
__bch2_verify_btree_nr_keys(b);
}
#endif /* _BCACHEFS_BSET_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,157 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_CACHE_H
#define _BCACHEFS_BTREE_CACHE_H
#include "bcachefs.h"
#include "btree_types.h"
#include "bkey_methods.h"
extern const char * const bch2_btree_node_flags[];
struct btree_iter;
void bch2_recalc_btree_reserve(struct bch_fs *);
void bch2_btree_node_to_freelist(struct bch_fs *, struct btree *);
void __bch2_btree_node_hash_remove(struct btree_cache *, struct btree *);
void bch2_btree_node_hash_remove(struct btree_cache *, struct btree *);
int __bch2_btree_node_hash_insert(struct btree_cache *, struct btree *);
int bch2_btree_node_hash_insert(struct btree_cache *, struct btree *,
unsigned, enum btree_id);
void bch2_node_pin(struct bch_fs *, struct btree *);
void bch2_btree_cache_unpin(struct bch_fs *);
void bch2_btree_node_update_key_early(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_i *);
void bch2_btree_cache_cannibalize_unlock(struct btree_trans *);
int bch2_btree_cache_cannibalize_lock(struct btree_trans *, struct closure *);
void __btree_node_data_free(struct btree *);
struct btree *__bch2_btree_node_mem_alloc(struct bch_fs *);
struct btree *bch2_btree_node_mem_alloc(struct btree_trans *, bool);
struct btree *bch2_btree_node_get(struct btree_trans *, struct btree_path *,
const struct bkey_i *, unsigned,
enum six_lock_type, unsigned long);
struct btree *bch2_btree_node_get_noiter(struct btree_trans *, const struct bkey_i *,
enum btree_id, unsigned, bool);
int bch2_btree_node_prefetch(struct btree_trans *, struct btree_path *,
const struct bkey_i *, enum btree_id, unsigned);
void bch2_btree_node_evict(struct btree_trans *, const struct bkey_i *);
void bch2_fs_btree_cache_exit(struct bch_fs *);
int bch2_fs_btree_cache_init(struct bch_fs *);
void bch2_fs_btree_cache_init_early(struct btree_cache *);
static inline u64 btree_ptr_hash_val(const struct bkey_i *k)
{
switch (k->k.type) {
case KEY_TYPE_btree_ptr:
return *((u64 *) bkey_i_to_btree_ptr_c(k)->v.start);
case KEY_TYPE_btree_ptr_v2:
/*
* The cast/deref is only necessary to avoid sparse endianness
* warnings:
*/
return *((u64 *) &bkey_i_to_btree_ptr_v2_c(k)->v.seq);
default:
return 0;
}
}
static inline struct btree *btree_node_mem_ptr(const struct bkey_i *k)
{
return k->k.type == KEY_TYPE_btree_ptr_v2
? (void *)(unsigned long)bkey_i_to_btree_ptr_v2_c(k)->v.mem_ptr
: NULL;
}
/* is btree node in hash table? */
static inline bool btree_node_hashed(struct btree *b)
{
return b->hash_val != 0;
}
#define for_each_cached_btree(_b, _c, _tbl, _iter, _pos) \
for ((_tbl) = rht_dereference_rcu((_c)->btree_cache.table.tbl, \
&(_c)->btree_cache.table), \
_iter = 0; _iter < (_tbl)->size; _iter++) \
rht_for_each_entry_rcu((_b), (_pos), _tbl, _iter, hash)
static inline size_t btree_buf_bytes(const struct btree *b)
{
return 1UL << b->byte_order;
}
static inline size_t btree_buf_max_u64s(const struct btree *b)
{
return (btree_buf_bytes(b) - sizeof(struct btree_node)) / sizeof(u64);
}
static inline size_t btree_max_u64s(const struct bch_fs *c)
{
return (c->opts.btree_node_size - sizeof(struct btree_node)) / sizeof(u64);
}
static inline size_t btree_sectors(const struct bch_fs *c)
{
return c->opts.btree_node_size >> SECTOR_SHIFT;
}
static inline unsigned btree_blocks(const struct bch_fs *c)
{
return btree_sectors(c) >> c->block_bits;
}
#define BTREE_SPLIT_THRESHOLD(c) (btree_max_u64s(c) * 2 / 3)
#define BTREE_FOREGROUND_MERGE_THRESHOLD(c) (btree_max_u64s(c) * 1 / 3)
#define BTREE_FOREGROUND_MERGE_HYSTERESIS(c) \
(BTREE_FOREGROUND_MERGE_THRESHOLD(c) + \
(BTREE_FOREGROUND_MERGE_THRESHOLD(c) >> 2))
static inline unsigned btree_id_nr_alive(struct bch_fs *c)
{
return BTREE_ID_NR + c->btree_roots_extra.nr;
}
static inline struct btree_root *bch2_btree_id_root(struct bch_fs *c, unsigned id)
{
if (likely(id < BTREE_ID_NR)) {
return &c->btree_roots_known[id];
} else {
unsigned idx = id - BTREE_ID_NR;
/* This can happen when we're called from btree_node_scan */
if (idx >= c->btree_roots_extra.nr)
return NULL;
return &c->btree_roots_extra.data[idx];
}
}
static inline struct btree *btree_node_root(struct bch_fs *c, struct btree *b)
{
struct btree_root *r = bch2_btree_id_root(c, b->c.btree_id);
return r ? r->b : NULL;
}
const char *bch2_btree_id_str(enum btree_id); /* avoid */
void bch2_btree_id_to_text(struct printbuf *, enum btree_id);
void bch2_btree_id_level_to_text(struct printbuf *, enum btree_id, unsigned);
void __bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *,
enum btree_id, unsigned, struct bkey_s_c);
void bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *, const struct btree *);
void bch2_btree_node_to_text(struct printbuf *, struct bch_fs *, const struct btree *);
void bch2_btree_cache_to_text(struct printbuf *, const struct btree_cache *);
#endif /* _BCACHEFS_BTREE_CACHE_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,88 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_GC_H
#define _BCACHEFS_BTREE_GC_H
#include "bkey.h"
#include "btree_gc_types.h"
#include "btree_types.h"
int bch2_check_topology(struct bch_fs *);
int bch2_check_allocations(struct bch_fs *);
/*
* For concurrent mark and sweep (with other index updates), we define a total
* ordering of _all_ references GC walks:
*
* Note that some references will have the same GC position as others - e.g.
* everything within the same btree node; in those cases we're relying on
* whatever locking exists for where those references live, i.e. the write lock
* on a btree node.
*
* That locking is also required to ensure GC doesn't pass the updater in
* between the updater adding/removing the reference and updating the GC marks;
* without that, we would at best double count sometimes.
*
* That part is important - whenever calling bch2_mark_pointers(), a lock _must_
* be held that prevents GC from passing the position the updater is at.
*
* (What about the start of gc, when we're clearing all the marks? GC clears the
* mark with the gc pos seqlock held, and bch_mark_bucket checks against the gc
* position inside its cmpxchg loop, so crap magically works).
*/
/* Position of (the start of) a gc phase: */
static inline struct gc_pos gc_phase(enum gc_phase phase)
{
return (struct gc_pos) { .phase = phase, };
}
static inline struct gc_pos gc_pos_btree(enum btree_id btree, unsigned level,
struct bpos pos)
{
return (struct gc_pos) {
.phase = GC_PHASE_btree,
.btree = btree,
.level = level,
.pos = pos,
};
}
static inline int gc_btree_order(enum btree_id btree)
{
if (btree == BTREE_ID_alloc)
return -2;
if (btree == BTREE_ID_stripes)
return -1;
return btree;
}
static inline int gc_pos_cmp(struct gc_pos l, struct gc_pos r)
{
return cmp_int(l.phase, r.phase) ?:
cmp_int(gc_btree_order(l.btree),
gc_btree_order(r.btree)) ?:
cmp_int(l.level, r.level) ?:
bpos_cmp(l.pos, r.pos);
}
static inline bool gc_visited(struct bch_fs *c, struct gc_pos pos)
{
unsigned seq;
bool ret;
do {
seq = read_seqcount_begin(&c->gc_pos_lock);
ret = gc_pos_cmp(pos, c->gc_pos) <= 0;
} while (read_seqcount_retry(&c->gc_pos_lock, seq));
return ret;
}
void bch2_gc_pos_to_text(struct printbuf *, struct gc_pos *);
int bch2_gc_gens(struct bch_fs *);
void bch2_gc_gens_async(struct bch_fs *);
void bch2_fs_btree_gc_init_early(struct bch_fs *);
#endif /* _BCACHEFS_BTREE_GC_H */

View File

@ -1,34 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_GC_TYPES_H
#define _BCACHEFS_BTREE_GC_TYPES_H
#include <linux/generic-radix-tree.h>
#define GC_PHASES() \
x(not_running) \
x(start) \
x(sb) \
x(btree)
enum gc_phase {
#define x(n) GC_PHASE_##n,
GC_PHASES()
#undef x
};
struct gc_pos {
enum gc_phase phase:8;
enum btree_id btree:8;
u16 level;
struct bpos pos;
};
struct reflink_gc {
u64 offset;
u32 size;
u32 refcount;
};
typedef GENRADIX(struct reflink_gc) reflink_gc_table;
#endif /* _BCACHEFS_BTREE_GC_TYPES_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,239 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_IO_H
#define _BCACHEFS_BTREE_IO_H
#include "bkey_methods.h"
#include "bset.h"
#include "btree_locking.h"
#include "checksum.h"
#include "extents.h"
#include "io_write_types.h"
struct bch_fs;
struct btree_write;
struct btree;
struct btree_iter;
struct btree_node_read_all;
static inline void set_btree_node_dirty_acct(struct bch_fs *c, struct btree *b)
{
if (!test_and_set_bit(BTREE_NODE_dirty, &b->flags))
atomic_long_inc(&c->btree_cache.nr_dirty);
}
static inline void clear_btree_node_dirty_acct(struct bch_fs *c, struct btree *b)
{
if (test_and_clear_bit(BTREE_NODE_dirty, &b->flags))
atomic_long_dec(&c->btree_cache.nr_dirty);
}
static inline unsigned btree_ptr_sectors_written(struct bkey_s_c k)
{
return k.k->type == KEY_TYPE_btree_ptr_v2
? le16_to_cpu(bkey_s_c_to_btree_ptr_v2(k).v->sectors_written)
: 0;
}
struct btree_read_bio {
struct bch_fs *c;
struct btree *b;
struct btree_node_read_all *ra;
u64 start_time;
unsigned have_ioref:1;
unsigned idx:7;
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
unsigned list_idx;
#endif
struct extent_ptr_decoded pick;
struct work_struct work;
struct bio bio;
};
struct btree_write_bio {
struct work_struct work;
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
void *data;
unsigned data_bytes;
unsigned sector_offset;
u64 start_time;
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
unsigned list_idx;
#endif
struct bch_write_bio wbio;
};
void bch2_btree_node_io_unlock(struct btree *);
void bch2_btree_node_io_lock(struct btree *);
void __bch2_btree_node_wait_on_read(struct btree *);
void __bch2_btree_node_wait_on_write(struct btree *);
void bch2_btree_node_wait_on_read(struct btree *);
void bch2_btree_node_wait_on_write(struct btree *);
enum compact_mode {
COMPACT_LAZY,
COMPACT_ALL,
};
bool bch2_compact_whiteouts(struct bch_fs *, struct btree *,
enum compact_mode);
static inline bool should_compact_bset_lazy(struct btree *b,
struct bset_tree *t)
{
unsigned total_u64s = bset_u64s(t);
unsigned dead_u64s = bset_dead_u64s(b, t);
return dead_u64s > 64 && dead_u64s * 3 > total_u64s;
}
static inline bool bch2_maybe_compact_whiteouts(struct bch_fs *c, struct btree *b)
{
for_each_bset(b, t)
if (should_compact_bset_lazy(b, t))
return bch2_compact_whiteouts(c, b, COMPACT_LAZY);
return false;
}
static inline struct nonce btree_nonce(struct bset *i, unsigned offset)
{
return (struct nonce) {{
[0] = cpu_to_le32(offset),
[1] = ((__le32 *) &i->seq)[0],
[2] = ((__le32 *) &i->seq)[1],
[3] = ((__le32 *) &i->journal_seq)[0]^BCH_NONCE_BTREE,
}};
}
static inline int bset_encrypt(struct bch_fs *c, struct bset *i, unsigned offset)
{
struct nonce nonce = btree_nonce(i, offset);
int ret;
if (!offset) {
struct btree_node *bn = container_of(i, struct btree_node, keys);
unsigned bytes = (void *) &bn->keys - (void *) &bn->flags;
ret = bch2_encrypt(c, BSET_CSUM_TYPE(i), nonce,
&bn->flags, bytes);
if (ret)
return ret;
nonce = nonce_add(nonce, round_up(bytes, CHACHA_BLOCK_SIZE));
}
return bch2_encrypt(c, BSET_CSUM_TYPE(i), nonce, i->_data,
vstruct_end(i) - (void *) i->_data);
}
void bch2_btree_sort_into(struct bch_fs *, struct btree *, struct btree *);
void bch2_btree_node_drop_keys_outside_node(struct btree *);
void bch2_btree_build_aux_trees(struct btree *);
void bch2_btree_init_next(struct btree_trans *, struct btree *);
int bch2_btree_node_read_done(struct bch_fs *, struct bch_dev *,
struct btree *,
struct bch_io_failures *,
struct printbuf *);
void bch2_btree_node_read(struct btree_trans *, struct btree *, bool);
int bch2_btree_root_read(struct bch_fs *, enum btree_id,
const struct bkey_i *, unsigned);
void bch2_btree_read_bio_to_text(struct printbuf *, struct btree_read_bio *);
int bch2_btree_node_scrub(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, unsigned);
bool bch2_btree_post_write_cleanup(struct bch_fs *, struct btree *);
enum btree_write_flags {
__BTREE_WRITE_ONLY_IF_NEED = BTREE_WRITE_TYPE_BITS,
__BTREE_WRITE_ALREADY_STARTED,
};
#define BTREE_WRITE_ONLY_IF_NEED BIT(__BTREE_WRITE_ONLY_IF_NEED)
#define BTREE_WRITE_ALREADY_STARTED BIT(__BTREE_WRITE_ALREADY_STARTED)
void __bch2_btree_node_write(struct bch_fs *, struct btree *, unsigned);
void bch2_btree_node_write(struct bch_fs *, struct btree *,
enum six_lock_type, unsigned);
void bch2_btree_node_write_trans(struct btree_trans *, struct btree *,
enum six_lock_type, unsigned);
static inline void btree_node_write_if_need(struct btree_trans *trans, struct btree *b,
enum six_lock_type lock_held)
{
bch2_btree_node_write_trans(trans, b, lock_held, BTREE_WRITE_ONLY_IF_NEED);
}
bool bch2_btree_flush_all_reads(struct bch_fs *);
bool bch2_btree_flush_all_writes(struct bch_fs *);
static inline void compat_bformat(unsigned level, enum btree_id btree_id,
unsigned version, unsigned big_endian,
int write, struct bkey_format *f)
{
if (version < bcachefs_metadata_version_inode_btree_change &&
btree_id == BTREE_ID_inodes) {
swap(f->bits_per_field[BKEY_FIELD_INODE],
f->bits_per_field[BKEY_FIELD_OFFSET]);
swap(f->field_offset[BKEY_FIELD_INODE],
f->field_offset[BKEY_FIELD_OFFSET]);
}
if (version < bcachefs_metadata_version_snapshot &&
(level || btree_type_has_snapshots(btree_id))) {
u64 max_packed =
~(~0ULL << f->bits_per_field[BKEY_FIELD_SNAPSHOT]);
f->field_offset[BKEY_FIELD_SNAPSHOT] = write
? 0
: cpu_to_le64(U32_MAX - max_packed);
}
}
static inline void compat_bpos(unsigned level, enum btree_id btree_id,
unsigned version, unsigned big_endian,
int write, struct bpos *p)
{
if (big_endian != CPU_BIG_ENDIAN)
bch2_bpos_swab(p);
if (version < bcachefs_metadata_version_inode_btree_change &&
btree_id == BTREE_ID_inodes)
swap(p->inode, p->offset);
}
static inline void compat_btree_node(unsigned level, enum btree_id btree_id,
unsigned version, unsigned big_endian,
int write,
struct btree_node *bn)
{
if (version < bcachefs_metadata_version_inode_btree_change &&
btree_id_is_extents(btree_id) &&
!bpos_eq(bn->min_key, POS_MIN) &&
write)
bn->min_key = bpos_nosnap_predecessor(bn->min_key);
if (version < bcachefs_metadata_version_snapshot &&
write)
bn->max_key.snapshot = 0;
compat_bpos(level, btree_id, version, big_endian, write, &bn->min_key);
compat_bpos(level, btree_id, version, big_endian, write, &bn->max_key);
if (version < bcachefs_metadata_version_snapshot &&
!write)
bn->max_key.snapshot = U32_MAX;
if (version < bcachefs_metadata_version_inode_btree_change &&
btree_id_is_extents(btree_id) &&
!bpos_eq(bn->min_key, POS_MIN) &&
!write)
bn->min_key = bpos_nosnap_successor(bn->min_key);
}
void bch2_btree_write_stats_to_text(struct printbuf *, struct bch_fs *);
#endif /* _BCACHEFS_BTREE_IO_H */

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,830 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "bkey_buf.h"
#include "bset.h"
#include "btree_cache.h"
#include "btree_journal_iter.h"
#include "journal_io.h"
#include <linux/sort.h>
/*
* For managing keys we read from the journal: until journal replay works normal
* btree lookups need to be able to find and return keys from the journal where
* they overwrite what's in the btree, so we have a special iterator and
* operations for the regular btree iter code to use:
*/
static inline size_t pos_to_idx(struct journal_keys *keys, size_t pos)
{
size_t gap_size = keys->size - keys->nr;
BUG_ON(pos >= keys->gap && pos < keys->gap + gap_size);
if (pos >= keys->gap)
pos -= gap_size;
return pos;
}
static inline size_t idx_to_pos(struct journal_keys *keys, size_t idx)
{
size_t gap_size = keys->size - keys->nr;
if (idx >= keys->gap)
idx += gap_size;
return idx;
}
static inline struct journal_key *idx_to_key(struct journal_keys *keys, size_t idx)
{
return keys->data + idx_to_pos(keys, idx);
}
static size_t __bch2_journal_key_search(struct journal_keys *keys,
enum btree_id id, unsigned level,
struct bpos pos)
{
size_t l = 0, r = keys->nr, m;
while (l < r) {
m = l + ((r - l) >> 1);
if (__journal_key_cmp(id, level, pos, idx_to_key(keys, m)) > 0)
l = m + 1;
else
r = m;
}
BUG_ON(l < keys->nr &&
__journal_key_cmp(id, level, pos, idx_to_key(keys, l)) > 0);
BUG_ON(l &&
__journal_key_cmp(id, level, pos, idx_to_key(keys, l - 1)) <= 0);
return l;
}
static size_t bch2_journal_key_search(struct journal_keys *keys,
enum btree_id id, unsigned level,
struct bpos pos)
{
return idx_to_pos(keys, __bch2_journal_key_search(keys, id, level, pos));
}
/* Returns first non-overwritten key >= search key: */
struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *c, enum btree_id btree_id,
unsigned level, struct bpos pos,
struct bpos end_pos, size_t *idx)
{
struct journal_keys *keys = &c->journal_keys;
unsigned iters = 0;
struct journal_key *k;
BUG_ON(*idx > keys->nr);
search:
if (!*idx)
*idx = __bch2_journal_key_search(keys, btree_id, level, pos);
while (*idx &&
__journal_key_cmp(btree_id, level, end_pos, idx_to_key(keys, *idx - 1)) <= 0) {
--(*idx);
iters++;
if (iters == 10) {
*idx = 0;
goto search;
}
}
struct bkey_i *ret = NULL;
rcu_read_lock(); /* for overwritten_ranges */
while ((k = *idx < keys->nr ? idx_to_key(keys, *idx) : NULL)) {
if (__journal_key_cmp(btree_id, level, end_pos, k) < 0)
break;
if (k->overwritten) {
if (k->overwritten_range)
*idx = rcu_dereference(k->overwritten_range)->end;
else
*idx += 1;
continue;
}
if (__journal_key_cmp(btree_id, level, pos, k) <= 0) {
ret = k->k;
break;
}
(*idx)++;
iters++;
if (iters == 10) {
*idx = 0;
rcu_read_unlock();
goto search;
}
}
rcu_read_unlock();
return ret;
}
struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *c, enum btree_id btree_id,
unsigned level, struct bpos pos,
struct bpos end_pos, size_t *idx)
{
struct journal_keys *keys = &c->journal_keys;
unsigned iters = 0;
struct journal_key *k;
BUG_ON(*idx > keys->nr);
if (!keys->nr)
return NULL;
search:
if (!*idx)
*idx = __bch2_journal_key_search(keys, btree_id, level, pos);
while (*idx < keys->nr &&
__journal_key_cmp(btree_id, level, end_pos, idx_to_key(keys, *idx)) >= 0) {
(*idx)++;
iters++;
if (iters == 10) {
*idx = 0;
goto search;
}
}
if (*idx == keys->nr)
--(*idx);
struct bkey_i *ret = NULL;
rcu_read_lock(); /* for overwritten_ranges */
while (true) {
k = idx_to_key(keys, *idx);
if (__journal_key_cmp(btree_id, level, end_pos, k) > 0)
break;
if (k->overwritten) {
if (k->overwritten_range)
*idx = rcu_dereference(k->overwritten_range)->start;
if (!*idx)
break;
--(*idx);
continue;
}
if (__journal_key_cmp(btree_id, level, pos, k) >= 0) {
ret = k->k;
break;
}
if (!*idx)
break;
--(*idx);
iters++;
if (iters == 10) {
*idx = 0;
goto search;
}
}
rcu_read_unlock();
return ret;
}
struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *c, enum btree_id btree_id,
unsigned level, struct bpos pos)
{
size_t idx = 0;
return bch2_journal_keys_peek_max(c, btree_id, level, pos, pos, &idx);
}
static void journal_iter_verify(struct journal_iter *iter)
{
#ifdef CONFIG_BCACHEFS_DEBUG
struct journal_keys *keys = iter->keys;
size_t gap_size = keys->size - keys->nr;
BUG_ON(iter->idx >= keys->gap &&
iter->idx < keys->gap + gap_size);
if (iter->idx < keys->size) {
struct journal_key *k = keys->data + iter->idx;
int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k);
BUG_ON(cmp > 0);
}
#endif
}
static void journal_iters_fix(struct bch_fs *c)
{
struct journal_keys *keys = &c->journal_keys;
/* The key we just inserted is immediately before the gap: */
size_t gap_end = keys->gap + (keys->size - keys->nr);
struct journal_key *new_key = &keys->data[keys->gap - 1];
struct journal_iter *iter;
/*
* If an iterator points one after the key we just inserted, decrement
* the iterator so it points at the key we just inserted - if the
* decrement was unnecessary, bch2_btree_and_journal_iter_peek() will
* handle that:
*/
list_for_each_entry(iter, &c->journal_iters, list) {
journal_iter_verify(iter);
if (iter->idx == gap_end &&
new_key->btree_id == iter->btree_id &&
new_key->level == iter->level)
iter->idx = keys->gap - 1;
journal_iter_verify(iter);
}
}
static void journal_iters_move_gap(struct bch_fs *c, size_t old_gap, size_t new_gap)
{
struct journal_keys *keys = &c->journal_keys;
struct journal_iter *iter;
size_t gap_size = keys->size - keys->nr;
list_for_each_entry(iter, &c->journal_iters, list) {
if (iter->idx > old_gap)
iter->idx -= gap_size;
if (iter->idx >= new_gap)
iter->idx += gap_size;
}
}
int bch2_journal_key_insert_take(struct bch_fs *c, enum btree_id id,
unsigned level, struct bkey_i *k)
{
struct journal_key n = {
.btree_id = id,
.level = level,
.k = k,
.allocated = true,
/*
* Ensure these keys are done last by journal replay, to unblock
* journal reclaim:
*/
.journal_seq = U64_MAX,
};
struct journal_keys *keys = &c->journal_keys;
size_t idx = bch2_journal_key_search(keys, id, level, k->k.p);
BUG_ON(test_bit(BCH_FS_rw, &c->flags));
if (idx < keys->size &&
journal_key_cmp(&n, &keys->data[idx]) == 0) {
if (keys->data[idx].allocated)
kfree(keys->data[idx].k);
keys->data[idx] = n;
return 0;
}
if (idx > keys->gap)
idx -= keys->size - keys->nr;
size_t old_gap = keys->gap;
if (keys->nr == keys->size) {
journal_iters_move_gap(c, old_gap, keys->size);
old_gap = keys->size;
struct journal_keys new_keys = {
.nr = keys->nr,
.size = max_t(size_t, keys->size, 8) * 2,
};
new_keys.data = bch2_kvmalloc(new_keys.size * sizeof(new_keys.data[0]), GFP_KERNEL);
if (!new_keys.data) {
bch_err(c, "%s: error allocating new key array (size %zu)",
__func__, new_keys.size);
return bch_err_throw(c, ENOMEM_journal_key_insert);
}
/* Since @keys was full, there was no gap: */
memcpy(new_keys.data, keys->data, sizeof(keys->data[0]) * keys->nr);
kvfree(keys->data);
keys->data = new_keys.data;
keys->nr = new_keys.nr;
keys->size = new_keys.size;
/* And now the gap is at the end: */
keys->gap = keys->nr;
}
journal_iters_move_gap(c, old_gap, idx);
move_gap(keys, idx);
keys->nr++;
keys->data[keys->gap++] = n;
journal_iters_fix(c);
return 0;
}
/*
* Can only be used from the recovery thread while we're still RO - can't be
* used once we've got RW, as journal_keys is at that point used by multiple
* threads:
*/
int bch2_journal_key_insert(struct bch_fs *c, enum btree_id id,
unsigned level, struct bkey_i *k)
{
struct bkey_i *n;
int ret;
n = kmalloc(bkey_bytes(&k->k), GFP_KERNEL);
if (!n)
return bch_err_throw(c, ENOMEM_journal_key_insert);
bkey_copy(n, k);
ret = bch2_journal_key_insert_take(c, id, level, n);
if (ret)
kfree(n);
return ret;
}
int bch2_journal_key_delete(struct bch_fs *c, enum btree_id id,
unsigned level, struct bpos pos)
{
struct bkey_i whiteout;
bkey_init(&whiteout.k);
whiteout.k.p = pos;
return bch2_journal_key_insert(c, id, level, &whiteout);
}
bool bch2_key_deleted_in_journal(struct btree_trans *trans, enum btree_id btree,
unsigned level, struct bpos pos)
{
struct journal_keys *keys = &trans->c->journal_keys;
size_t idx = bch2_journal_key_search(keys, btree, level, pos);
if (!trans->journal_replay_not_finished)
return false;
return (idx < keys->size &&
keys->data[idx].btree_id == btree &&
keys->data[idx].level == level &&
bpos_eq(keys->data[idx].k->k.p, pos) &&
bkey_deleted(&keys->data[idx].k->k));
}
static void __bch2_journal_key_overwritten(struct journal_keys *keys, size_t pos)
{
struct journal_key *k = keys->data + pos;
size_t idx = pos_to_idx(keys, pos);
k->overwritten = true;
struct journal_key *prev = idx > 0 ? keys->data + idx_to_pos(keys, idx - 1) : NULL;
struct journal_key *next = idx + 1 < keys->nr ? keys->data + idx_to_pos(keys, idx + 1) : NULL;
bool prev_overwritten = prev && prev->overwritten;
bool next_overwritten = next && next->overwritten;
struct journal_key_range_overwritten *prev_range =
prev_overwritten ? prev->overwritten_range : NULL;
struct journal_key_range_overwritten *next_range =
next_overwritten ? next->overwritten_range : NULL;
BUG_ON(prev_range && prev_range->end != idx);
BUG_ON(next_range && next_range->start != idx + 1);
if (prev_range && next_range) {
prev_range->end = next_range->end;
keys->data[pos].overwritten_range = prev_range;
for (size_t i = next_range->start; i < next_range->end; i++) {
struct journal_key *ip = keys->data + idx_to_pos(keys, i);
BUG_ON(ip->overwritten_range != next_range);
ip->overwritten_range = prev_range;
}
kfree_rcu_mightsleep(next_range);
} else if (prev_range) {
prev_range->end++;
k->overwritten_range = prev_range;
if (next_overwritten) {
prev_range->end++;
next->overwritten_range = prev_range;
}
} else if (next_range) {
next_range->start--;
k->overwritten_range = next_range;
if (prev_overwritten) {
next_range->start--;
prev->overwritten_range = next_range;
}
} else if (prev_overwritten || next_overwritten) {
struct journal_key_range_overwritten *r = kmalloc(sizeof(*r), GFP_KERNEL);
if (!r)
return;
r->start = idx - (size_t) prev_overwritten;
r->end = idx + 1 + (size_t) next_overwritten;
rcu_assign_pointer(k->overwritten_range, r);
if (prev_overwritten)
prev->overwritten_range = r;
if (next_overwritten)
next->overwritten_range = r;
}
}
void bch2_journal_key_overwritten(struct bch_fs *c, enum btree_id btree,
unsigned level, struct bpos pos)
{
struct journal_keys *keys = &c->journal_keys;
size_t idx = bch2_journal_key_search(keys, btree, level, pos);
if (idx < keys->size &&
keys->data[idx].btree_id == btree &&
keys->data[idx].level == level &&
bpos_eq(keys->data[idx].k->k.p, pos) &&
!keys->data[idx].overwritten) {
mutex_lock(&keys->overwrite_lock);
__bch2_journal_key_overwritten(keys, idx);
mutex_unlock(&keys->overwrite_lock);
}
}
static void bch2_journal_iter_advance(struct journal_iter *iter)
{
if (iter->idx < iter->keys->size) {
iter->idx++;
if (iter->idx == iter->keys->gap)
iter->idx += iter->keys->size - iter->keys->nr;
}
}
static struct bkey_s_c bch2_journal_iter_peek(struct journal_iter *iter)
{
journal_iter_verify(iter);
guard(rcu)();
while (iter->idx < iter->keys->size) {
struct journal_key *k = iter->keys->data + iter->idx;
int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k);
if (cmp < 0)
break;
BUG_ON(cmp);
if (!k->overwritten)
return bkey_i_to_s_c(k->k);
if (k->overwritten_range)
iter->idx = idx_to_pos(iter->keys, rcu_dereference(k->overwritten_range)->end);
else
bch2_journal_iter_advance(iter);
}
return bkey_s_c_null;
}
static void bch2_journal_iter_exit(struct journal_iter *iter)
{
list_del(&iter->list);
}
static void bch2_journal_iter_init(struct bch_fs *c,
struct journal_iter *iter,
enum btree_id id, unsigned level,
struct bpos pos)
{
iter->btree_id = id;
iter->level = level;
iter->keys = &c->journal_keys;
iter->idx = bch2_journal_key_search(&c->journal_keys, id, level, pos);
journal_iter_verify(iter);
}
static struct bkey_s_c bch2_journal_iter_peek_btree(struct btree_and_journal_iter *iter)
{
return bch2_btree_node_iter_peek_unpack(&iter->node_iter,
iter->b, &iter->unpacked);
}
static void bch2_journal_iter_advance_btree(struct btree_and_journal_iter *iter)
{
bch2_btree_node_iter_advance(&iter->node_iter, iter->b);
}
void bch2_btree_and_journal_iter_advance(struct btree_and_journal_iter *iter)
{
if (bpos_eq(iter->pos, SPOS_MAX))
iter->at_end = true;
else
iter->pos = bpos_successor(iter->pos);
}
static void btree_and_journal_iter_prefetch(struct btree_and_journal_iter *_iter)
{
struct btree_and_journal_iter iter = *_iter;
struct bch_fs *c = iter.trans->c;
unsigned level = iter.journal.level;
struct bkey_buf tmp;
unsigned nr = test_bit(BCH_FS_started, &c->flags)
? (level > 1 ? 0 : 2)
: (level > 1 ? 1 : 16);
iter.prefetch = false;
iter.fail_if_too_many_whiteouts = true;
bch2_bkey_buf_init(&tmp);
while (nr--) {
bch2_btree_and_journal_iter_advance(&iter);
struct bkey_s_c k = bch2_btree_and_journal_iter_peek(&iter);
if (!k.k)
break;
bch2_bkey_buf_reassemble(&tmp, c, k);
bch2_btree_node_prefetch(iter.trans, NULL, tmp.k, iter.journal.btree_id, level - 1);
}
bch2_bkey_buf_exit(&tmp, c);
}
struct bkey_s_c bch2_btree_and_journal_iter_peek(struct btree_and_journal_iter *iter)
{
struct bkey_s_c btree_k, journal_k = bkey_s_c_null, ret;
size_t iters = 0;
if (iter->prefetch && iter->journal.level)
btree_and_journal_iter_prefetch(iter);
again:
if (iter->at_end)
return bkey_s_c_null;
iters++;
if (iters > 20 && iter->fail_if_too_many_whiteouts)
return bkey_s_c_null;
while ((btree_k = bch2_journal_iter_peek_btree(iter)).k &&
bpos_lt(btree_k.k->p, iter->pos))
bch2_journal_iter_advance_btree(iter);
if (iter->trans->journal_replay_not_finished)
while ((journal_k = bch2_journal_iter_peek(&iter->journal)).k &&
bpos_lt(journal_k.k->p, iter->pos))
bch2_journal_iter_advance(&iter->journal);
ret = journal_k.k &&
(!btree_k.k || bpos_le(journal_k.k->p, btree_k.k->p))
? journal_k
: btree_k;
if (ret.k && iter->b && bpos_gt(ret.k->p, iter->b->data->max_key))
ret = bkey_s_c_null;
if (ret.k) {
iter->pos = ret.k->p;
if (bkey_deleted(ret.k)) {
bch2_btree_and_journal_iter_advance(iter);
goto again;
}
} else {
iter->pos = SPOS_MAX;
iter->at_end = true;
}
return ret;
}
void bch2_btree_and_journal_iter_exit(struct btree_and_journal_iter *iter)
{
bch2_journal_iter_exit(&iter->journal);
}
void __bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *trans,
struct btree_and_journal_iter *iter,
struct btree *b,
struct btree_node_iter node_iter,
struct bpos pos)
{
memset(iter, 0, sizeof(*iter));
iter->trans = trans;
iter->b = b;
iter->node_iter = node_iter;
iter->pos = b->data->min_key;
iter->at_end = false;
INIT_LIST_HEAD(&iter->journal.list);
if (trans->journal_replay_not_finished) {
bch2_journal_iter_init(trans->c, &iter->journal, b->c.btree_id, b->c.level, pos);
if (!test_bit(BCH_FS_may_go_rw, &trans->c->flags))
list_add(&iter->journal.list, &trans->c->journal_iters);
}
}
/*
* this version is used by btree_gc before filesystem has gone RW and
* multithreaded, so uses the journal_iters list:
*/
void bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *trans,
struct btree_and_journal_iter *iter,
struct btree *b)
{
struct btree_node_iter node_iter;
bch2_btree_node_iter_init_from_start(&node_iter, b);
__bch2_btree_and_journal_iter_init_node_iter(trans, iter, b, node_iter, b->data->min_key);
}
/* sort and dedup all keys in the journal: */
/*
* When keys compare equal, oldest compares first:
*/
static int journal_sort_key_cmp(const void *_l, const void *_r)
{
const struct journal_key *l = _l;
const struct journal_key *r = _r;
int rewind = l->rewind && r->rewind ? -1 : 1;
return journal_key_cmp(l, r) ?:
((cmp_int(l->journal_seq, r->journal_seq) ?:
cmp_int(l->journal_offset, r->journal_offset)) * rewind);
}
void bch2_journal_keys_put(struct bch_fs *c)
{
struct journal_keys *keys = &c->journal_keys;
BUG_ON(atomic_read(&keys->ref) <= 0);
if (!atomic_dec_and_test(&keys->ref))
return;
move_gap(keys, keys->nr);
darray_for_each(*keys, i) {
if (i->overwritten_range &&
(i == &darray_last(*keys) ||
i->overwritten_range != i[1].overwritten_range))
kfree(i->overwritten_range);
if (i->allocated)
kfree(i->k);
}
kvfree(keys->data);
keys->data = NULL;
keys->nr = keys->gap = keys->size = 0;
struct journal_replay **i;
struct genradix_iter iter;
genradix_for_each(&c->journal_entries, iter, i)
kvfree(*i);
genradix_free(&c->journal_entries);
}
static void __journal_keys_sort(struct journal_keys *keys)
{
sort_nonatomic(keys->data, keys->nr, sizeof(keys->data[0]),
journal_sort_key_cmp, NULL);
cond_resched();
struct journal_key *dst = keys->data;
darray_for_each(*keys, src) {
/*
* We don't accumulate accounting keys here because we have to
* compare each individual accounting key against the version in
* the btree during replay:
*/
if (src->k->k.type != KEY_TYPE_accounting &&
src + 1 < &darray_top(*keys) &&
!journal_key_cmp(src, src + 1))
continue;
*dst++ = *src;
}
keys->nr = dst - keys->data;
}
int bch2_journal_keys_sort(struct bch_fs *c)
{
struct genradix_iter iter;
struct journal_replay *i, **_i;
struct journal_keys *keys = &c->journal_keys;
size_t nr_read = 0;
u64 rewind_seq = c->opts.journal_rewind ?: U64_MAX;
genradix_for_each(&c->journal_entries, iter, _i) {
i = *_i;
if (journal_replay_ignore(i))
continue;
cond_resched();
vstruct_for_each(&i->j, entry) {
bool rewind = !entry->level &&
!btree_id_is_alloc(entry->btree_id) &&
le64_to_cpu(i->j.seq) >= rewind_seq;
if (entry->type != (rewind
? BCH_JSET_ENTRY_overwrite
: BCH_JSET_ENTRY_btree_keys))
continue;
if (!rewind && le64_to_cpu(i->j.seq) < c->journal_replay_seq_start)
continue;
jset_entry_for_each_key(entry, k) {
struct journal_key n = (struct journal_key) {
.btree_id = entry->btree_id,
.level = entry->level,
.rewind = rewind,
.k = k,
.journal_seq = le64_to_cpu(i->j.seq),
.journal_offset = k->_data - i->j._data,
};
if (darray_push(keys, n)) {
__journal_keys_sort(keys);
if (keys->nr * 8 > keys->size * 7) {
bch_err(c, "Too many journal keys for slowpath; have %zu compacted, buf size %zu, processed %zu keys at seq %llu",
keys->nr, keys->size, nr_read, le64_to_cpu(i->j.seq));
return bch_err_throw(c, ENOMEM_journal_keys_sort);
}
BUG_ON(darray_push(keys, n));
}
nr_read++;
}
}
}
__journal_keys_sort(keys);
keys->gap = keys->nr;
bch_verbose(c, "Journal keys: %zu read, %zu after sorting and compacting", nr_read, keys->nr);
return 0;
}
void bch2_shoot_down_journal_keys(struct bch_fs *c, enum btree_id btree,
unsigned level_min, unsigned level_max,
struct bpos start, struct bpos end)
{
struct journal_keys *keys = &c->journal_keys;
size_t dst = 0;
move_gap(keys, keys->nr);
darray_for_each(*keys, i)
if (!(i->btree_id == btree &&
i->level >= level_min &&
i->level <= level_max &&
bpos_ge(i->k->k.p, start) &&
bpos_le(i->k->k.p, end)))
keys->data[dst++] = *i;
keys->nr = keys->gap = dst;
}
void bch2_journal_keys_dump(struct bch_fs *c)
{
struct journal_keys *keys = &c->journal_keys;
struct printbuf buf = PRINTBUF;
pr_info("%zu keys:", keys->nr);
move_gap(keys, keys->nr);
darray_for_each(*keys, i) {
printbuf_reset(&buf);
prt_printf(&buf, "btree=");
bch2_btree_id_to_text(&buf, i->btree_id);
prt_printf(&buf, " l=%u ", i->level);
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(i->k));
pr_err("%s", buf.buf);
}
printbuf_exit(&buf);
}
void bch2_fs_journal_keys_init(struct bch_fs *c)
{
struct journal_keys *keys = &c->journal_keys;
atomic_set(&keys->ref, 1);
keys->initial_ref_held = true;
mutex_init(&keys->overwrite_lock);
}

View File

@ -1,102 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_JOURNAL_ITER_H
#define _BCACHEFS_BTREE_JOURNAL_ITER_H
#include "bkey.h"
struct journal_iter {
struct list_head list;
enum btree_id btree_id;
unsigned level;
size_t idx;
struct journal_keys *keys;
};
/*
* Iterate over keys in the btree, with keys from the journal overlaid on top:
*/
struct btree_and_journal_iter {
struct btree_trans *trans;
struct btree *b;
struct btree_node_iter node_iter;
struct bkey unpacked;
struct journal_iter journal;
struct bpos pos;
bool at_end;
bool prefetch;
bool fail_if_too_many_whiteouts;
};
static inline int __journal_key_btree_cmp(enum btree_id l_btree_id,
unsigned l_level,
const struct journal_key *r)
{
return -cmp_int(l_level, r->level) ?:
cmp_int(l_btree_id, r->btree_id);
}
static inline int __journal_key_cmp(enum btree_id l_btree_id,
unsigned l_level,
struct bpos l_pos,
const struct journal_key *r)
{
return __journal_key_btree_cmp(l_btree_id, l_level, r) ?:
bpos_cmp(l_pos, r->k->k.p);
}
static inline int journal_key_cmp(const struct journal_key *l, const struct journal_key *r)
{
return __journal_key_cmp(l->btree_id, l->level, l->k->k.p, r);
}
struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *, enum btree_id,
unsigned, struct bpos, struct bpos, size_t *);
struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *, enum btree_id,
unsigned, struct bpos, struct bpos, size_t *);
struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *, enum btree_id,
unsigned, struct bpos);
int bch2_btree_and_journal_iter_prefetch(struct btree_trans *, struct btree_path *,
struct btree_and_journal_iter *);
int bch2_journal_key_insert_take(struct bch_fs *, enum btree_id,
unsigned, struct bkey_i *);
int bch2_journal_key_insert(struct bch_fs *, enum btree_id,
unsigned, struct bkey_i *);
int bch2_journal_key_delete(struct bch_fs *, enum btree_id,
unsigned, struct bpos);
bool bch2_key_deleted_in_journal(struct btree_trans *, enum btree_id, unsigned, struct bpos);
void bch2_journal_key_overwritten(struct bch_fs *, enum btree_id, unsigned, struct bpos);
void bch2_btree_and_journal_iter_advance(struct btree_and_journal_iter *);
struct bkey_s_c bch2_btree_and_journal_iter_peek(struct btree_and_journal_iter *);
void bch2_btree_and_journal_iter_exit(struct btree_and_journal_iter *);
void __bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *,
struct btree_and_journal_iter *, struct btree *,
struct btree_node_iter, struct bpos);
void bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *,
struct btree_and_journal_iter *, struct btree *);
void bch2_journal_keys_put(struct bch_fs *);
static inline void bch2_journal_keys_put_initial(struct bch_fs *c)
{
if (c->journal_keys.initial_ref_held)
bch2_journal_keys_put(c);
c->journal_keys.initial_ref_held = false;
}
int bch2_journal_keys_sort(struct bch_fs *);
void bch2_shoot_down_journal_keys(struct bch_fs *, enum btree_id,
unsigned, unsigned,
struct bpos, struct bpos);
void bch2_journal_keys_dump(struct bch_fs *);
void bch2_fs_journal_keys_init(struct bch_fs *);
#endif /* _BCACHEFS_BTREE_JOURNAL_ITER_H */

View File

@ -1,37 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H
#define _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H
struct journal_key_range_overwritten {
size_t start, end;
};
struct journal_key {
u64 journal_seq;
u32 journal_offset;
enum btree_id btree_id:8;
unsigned level:8;
bool allocated:1;
bool overwritten:1;
bool rewind:1;
struct journal_key_range_overwritten __rcu *
overwritten_range;
struct bkey_i *k;
};
struct journal_keys {
/* must match layout in darray_types.h */
size_t nr, size;
struct journal_key *data;
/*
* Gap buffer: instead of all the empty space in the array being at the
* end of the buffer - from @nr to @size - the empty space is at @gap.
* This means that sequential insertions are O(n) instead of O(n^2).
*/
size_t gap;
atomic_t ref;
bool initial_ref_held;
struct mutex overwrite_lock;
};
#endif /* _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H */

View File

@ -1,880 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "btree_cache.h"
#include "btree_iter.h"
#include "btree_key_cache.h"
#include "btree_locking.h"
#include "btree_update.h"
#include "errcode.h"
#include "error.h"
#include "journal.h"
#include "journal_reclaim.h"
#include "trace.h"
#include <linux/sched/mm.h>
static inline bool btree_uses_pcpu_readers(enum btree_id id)
{
return id == BTREE_ID_subvolumes;
}
static struct kmem_cache *bch2_key_cache;
static int bch2_btree_key_cache_cmp_fn(struct rhashtable_compare_arg *arg,
const void *obj)
{
const struct bkey_cached *ck = obj;
const struct bkey_cached_key *key = arg->key;
return ck->key.btree_id != key->btree_id ||
!bpos_eq(ck->key.pos, key->pos);
}
static const struct rhashtable_params bch2_btree_key_cache_params = {
.head_offset = offsetof(struct bkey_cached, hash),
.key_offset = offsetof(struct bkey_cached, key),
.key_len = sizeof(struct bkey_cached_key),
.obj_cmpfn = bch2_btree_key_cache_cmp_fn,
.automatic_shrinking = true,
};
static inline void btree_path_cached_set(struct btree_trans *trans, struct btree_path *path,
struct bkey_cached *ck,
enum btree_node_locked_type lock_held)
{
path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
path->l[0].b = (void *) ck;
mark_btree_node_locked(trans, path, 0, lock_held);
}
__flatten
inline struct bkey_cached *
bch2_btree_key_cache_find(struct bch_fs *c, enum btree_id btree_id, struct bpos pos)
{
struct bkey_cached_key key = {
.btree_id = btree_id,
.pos = pos,
};
return rhashtable_lookup_fast(&c->btree_key_cache.table, &key,
bch2_btree_key_cache_params);
}
static bool bkey_cached_lock_for_evict(struct bkey_cached *ck)
{
if (!six_trylock_intent(&ck->c.lock))
return false;
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
six_unlock_intent(&ck->c.lock);
return false;
}
if (!six_trylock_write(&ck->c.lock)) {
six_unlock_intent(&ck->c.lock);
return false;
}
return true;
}
static bool bkey_cached_evict(struct btree_key_cache *c,
struct bkey_cached *ck)
{
bool ret = !rhashtable_remove_fast(&c->table, &ck->hash,
bch2_btree_key_cache_params);
if (ret) {
memset(&ck->key, ~0, sizeof(ck->key));
atomic_long_dec(&c->nr_keys);
}
return ret;
}
static void __bkey_cached_free(struct rcu_pending *pending, struct rcu_head *rcu)
{
struct bch_fs *c = container_of(pending->srcu, struct bch_fs, btree_trans_barrier);
struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu);
this_cpu_dec(*c->btree_key_cache.nr_pending);
kmem_cache_free(bch2_key_cache, ck);
}
static inline void bkey_cached_free_noassert(struct btree_key_cache *bc,
struct bkey_cached *ck)
{
kfree(ck->k);
ck->k = NULL;
ck->u64s = 0;
six_unlock_write(&ck->c.lock);
six_unlock_intent(&ck->c.lock);
bool pcpu_readers = ck->c.lock.readers != NULL;
rcu_pending_enqueue(&bc->pending[pcpu_readers], &ck->rcu);
this_cpu_inc(*bc->nr_pending);
}
static void bkey_cached_free(struct btree_trans *trans,
struct btree_key_cache *bc,
struct bkey_cached *ck)
{
/*
* we'll hit strange issues in the SRCU code if we aren't holding an
* SRCU read lock...
*/
EBUG_ON(!trans->srcu_held);
bkey_cached_free_noassert(bc, ck);
}
static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s, gfp_t gfp)
{
gfp |= __GFP_ACCOUNT|__GFP_RECLAIMABLE;
struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, gfp);
if (unlikely(!ck))
return NULL;
ck->k = kmalloc(key_u64s * sizeof(u64), gfp);
if (unlikely(!ck->k)) {
kmem_cache_free(bch2_key_cache, ck);
return NULL;
}
ck->u64s = key_u64s;
return ck;
}
static struct bkey_cached *
bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path, unsigned key_u64s)
{
struct bch_fs *c = trans->c;
struct btree_key_cache *bc = &c->btree_key_cache;
bool pcpu_readers = btree_uses_pcpu_readers(path->btree_id);
int ret;
struct bkey_cached *ck = container_of_or_null(
rcu_pending_dequeue(&bc->pending[pcpu_readers]),
struct bkey_cached, rcu);
if (ck)
goto lock;
ck = allocate_dropping_locks(trans, ret,
__bkey_cached_alloc(key_u64s, _gfp));
if (ret) {
if (ck)
kfree(ck->k);
kmem_cache_free(bch2_key_cache, ck);
return ERR_PTR(ret);
}
if (ck) {
bch2_btree_lock_init(&ck->c, pcpu_readers ? SIX_LOCK_INIT_PCPU : 0, GFP_KERNEL);
ck->c.cached = true;
goto lock;
}
ck = container_of_or_null(rcu_pending_dequeue_from_all(&bc->pending[pcpu_readers]),
struct bkey_cached, rcu);
if (ck)
goto lock;
lock:
six_lock_intent(&ck->c.lock, NULL, NULL);
six_lock_write(&ck->c.lock, NULL, NULL);
return ck;
}
static struct bkey_cached *
bkey_cached_reuse(struct btree_key_cache *c)
{
guard(rcu)();
struct bucket_table *tbl = rht_dereference_rcu(c->table.tbl, &c->table);
struct rhash_head *pos;
struct bkey_cached *ck;
for (unsigned i = 0; i < tbl->size; i++)
rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags) &&
bkey_cached_lock_for_evict(ck)) {
if (bkey_cached_evict(c, ck))
return ck;
six_unlock_write(&ck->c.lock);
six_unlock_intent(&ck->c.lock);
}
}
return NULL;
}
static int btree_key_cache_create(struct btree_trans *trans,
struct btree_path *path,
struct btree_path *ck_path,
struct bkey_s_c k)
{
struct bch_fs *c = trans->c;
struct btree_key_cache *bc = &c->btree_key_cache;
/*
* bch2_varint_decode can read past the end of the buffer by at
* most 7 bytes (it won't be used):
*/
unsigned key_u64s = k.k->u64s + 1;
/*
* Allocate some extra space so that the transaction commit path is less
* likely to have to reallocate, since that requires a transaction
* restart:
*/
key_u64s = min(256U, (key_u64s * 3) / 2);
key_u64s = roundup_pow_of_two(key_u64s);
struct bkey_cached *ck = bkey_cached_alloc(trans, ck_path, key_u64s);
int ret = PTR_ERR_OR_ZERO(ck);
if (ret)
return ret;
if (unlikely(!ck)) {
ck = bkey_cached_reuse(bc);
if (unlikely(!ck)) {
bch_err(c, "error allocating memory for key cache item, btree %s",
bch2_btree_id_str(ck_path->btree_id));
return bch_err_throw(c, ENOMEM_btree_key_cache_create);
}
}
ck->c.level = 0;
ck->c.btree_id = ck_path->btree_id;
ck->key.btree_id = ck_path->btree_id;
ck->key.pos = ck_path->pos;
ck->flags = 1U << BKEY_CACHED_ACCESSED;
if (unlikely(key_u64s > ck->u64s)) {
mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED);
struct bkey_i *new_k = allocate_dropping_locks(trans, ret,
kmalloc(key_u64s * sizeof(u64), _gfp));
if (unlikely(!new_k)) {
bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u",
bch2_btree_id_str(ck->key.btree_id), key_u64s);
ret = bch_err_throw(c, ENOMEM_btree_key_cache_fill);
} else if (ret) {
kfree(new_k);
goto err;
}
kfree(ck->k);
ck->k = new_k;
ck->u64s = key_u64s;
}
bkey_reassemble(ck->k, k);
ret = bch2_btree_node_lock_write(trans, path, &path_l(path)->b->c);
if (unlikely(ret))
goto err;
ret = rhashtable_lookup_insert_fast(&bc->table, &ck->hash, bch2_btree_key_cache_params);
bch2_btree_node_unlock_write(trans, path, path_l(path)->b);
if (unlikely(ret)) /* raced with another fill? */
goto err;
atomic_long_inc(&bc->nr_keys);
six_unlock_write(&ck->c.lock);
enum six_lock_type lock_want = __btree_lock_want(ck_path, 0);
if (lock_want == SIX_LOCK_read)
six_lock_downgrade(&ck->c.lock);
btree_path_cached_set(trans, ck_path, ck, (enum btree_node_locked_type) lock_want);
ck_path->uptodate = BTREE_ITER_UPTODATE;
return 0;
err:
bkey_cached_free(trans, bc, ck);
mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED);
return ret;
}
static noinline_for_stack void do_trace_key_cache_fill(struct btree_trans *trans,
struct btree_path *ck_path,
struct bkey_s_c k)
{
struct printbuf buf = PRINTBUF;
bch2_bpos_to_text(&buf, ck_path->pos);
prt_char(&buf, ' ');
bch2_bkey_val_to_text(&buf, trans->c, k);
trace_key_cache_fill(trans, buf.buf);
printbuf_exit(&buf);
}
static noinline int btree_key_cache_fill(struct btree_trans *trans,
btree_path_idx_t ck_path_idx,
unsigned flags)
{
struct btree_path *ck_path = trans->paths + ck_path_idx;
if (flags & BTREE_ITER_cached_nofill) {
ck_path->l[0].b = NULL;
return 0;
}
struct bch_fs *c = trans->c;
struct btree_iter iter;
struct bkey_s_c k;
int ret;
bch2_trans_iter_init(trans, &iter, ck_path->btree_id, ck_path->pos,
BTREE_ITER_intent|
BTREE_ITER_key_cache_fill|
BTREE_ITER_cached_nofill);
iter.flags &= ~BTREE_ITER_with_journal;
k = bch2_btree_iter_peek_slot(trans, &iter);
ret = bkey_err(k);
if (ret)
goto err;
/* Recheck after btree lookup, before allocating: */
ck_path = trans->paths + ck_path_idx;
ret = bch2_btree_key_cache_find(c, ck_path->btree_id, ck_path->pos) ? -EEXIST : 0;
if (unlikely(ret))
goto out;
ret = btree_key_cache_create(trans, btree_iter_path(trans, &iter), ck_path, k);
if (ret)
goto err;
if (trace_key_cache_fill_enabled())
do_trace_key_cache_fill(trans, ck_path, k);
out:
/* We're not likely to need this iterator again: */
bch2_set_btree_iter_dontneed(trans, &iter);
err:
bch2_trans_iter_exit(trans, &iter);
return ret;
}
static inline int btree_path_traverse_cached_fast(struct btree_trans *trans,
btree_path_idx_t path_idx)
{
struct bch_fs *c = trans->c;
struct bkey_cached *ck;
struct btree_path *path = trans->paths + path_idx;
retry:
ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
if (!ck)
return -ENOENT;
enum six_lock_type lock_want = __btree_lock_want(path, 0);
int ret = btree_node_lock(trans, path, (void *) ck, 0, lock_want, _THIS_IP_);
if (ret)
return ret;
if (ck->key.btree_id != path->btree_id ||
!bpos_eq(ck->key.pos, path->pos)) {
six_unlock_type(&ck->c.lock, lock_want);
goto retry;
}
if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
btree_path_cached_set(trans, path, ck, (enum btree_node_locked_type) lock_want);
path->uptodate = BTREE_ITER_UPTODATE;
return 0;
}
int bch2_btree_path_traverse_cached(struct btree_trans *trans,
btree_path_idx_t path_idx,
unsigned flags)
{
EBUG_ON(trans->paths[path_idx].level);
int ret;
do {
ret = btree_path_traverse_cached_fast(trans, path_idx);
if (unlikely(ret == -ENOENT))
ret = btree_key_cache_fill(trans, path_idx, flags);
} while (ret == -EEXIST);
struct btree_path *path = trans->paths + path_idx;
if (unlikely(ret)) {
path->uptodate = BTREE_ITER_NEED_TRAVERSE;
if (!bch2_err_matches(ret, BCH_ERR_transaction_restart)) {
btree_node_unlock(trans, path, 0);
path->l[0].b = ERR_PTR(ret);
}
} else {
BUG_ON(path->uptodate);
BUG_ON(!path->nodes_locked);
}
return ret;
}
static int btree_key_cache_flush_pos(struct btree_trans *trans,
struct bkey_cached_key key,
u64 journal_seq,
unsigned commit_flags,
bool evict)
{
struct bch_fs *c = trans->c;
struct journal *j = &c->journal;
struct btree_iter c_iter, b_iter;
struct bkey_cached *ck = NULL;
int ret;
bch2_trans_iter_init(trans, &b_iter, key.btree_id, key.pos,
BTREE_ITER_slots|
BTREE_ITER_intent|
BTREE_ITER_all_snapshots);
bch2_trans_iter_init(trans, &c_iter, key.btree_id, key.pos,
BTREE_ITER_cached|
BTREE_ITER_intent);
b_iter.flags &= ~BTREE_ITER_with_key_cache;
ret = bch2_btree_iter_traverse(trans, &c_iter);
if (ret)
goto out;
ck = (void *) btree_iter_path(trans, &c_iter)->l[0].b;
if (!ck)
goto out;
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
if (evict)
goto evict;
goto out;
}
if (journal_seq && ck->journal.seq != journal_seq)
goto out;
trans->journal_res.seq = ck->journal.seq;
/*
* If we're at the end of the journal, we really want to free up space
* in the journal right away - we don't want to pin that old journal
* sequence number with a new btree node write, we want to re-journal
* the update
*/
if (ck->journal.seq == journal_last_seq(j))
commit_flags |= BCH_WATERMARK_reclaim;
if (ck->journal.seq != journal_last_seq(j) ||
!test_bit(JOURNAL_space_low, &c->journal.flags))
commit_flags |= BCH_TRANS_COMMIT_no_journal_res;
struct bkey_s_c btree_k = bch2_btree_iter_peek_slot(trans, &b_iter);
ret = bkey_err(btree_k);
if (ret)
goto err;
/* * Check that we're not violating cache coherency rules: */
BUG_ON(bkey_deleted(btree_k.k));
ret = bch2_trans_update(trans, &b_iter, ck->k,
BTREE_UPDATE_key_cache_reclaim|
BTREE_UPDATE_internal_snapshot_node|
BTREE_TRIGGER_norun) ?:
bch2_trans_commit(trans, NULL, NULL,
BCH_TRANS_COMMIT_no_check_rw|
BCH_TRANS_COMMIT_no_enospc|
commit_flags);
err:
bch2_fs_fatal_err_on(ret &&
!bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
!bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
!bch2_journal_error(j), c,
"flushing key cache: %s", bch2_err_str(ret));
if (ret)
goto out;
bch2_journal_pin_drop(j, &ck->journal);
struct btree_path *path = btree_iter_path(trans, &c_iter);
BUG_ON(!btree_node_locked(path, 0));
if (!evict) {
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
atomic_long_dec(&c->btree_key_cache.nr_dirty);
}
} else {
struct btree_path *path2;
unsigned i;
evict:
trans_for_each_path(trans, path2, i)
if (path2 != path)
__bch2_btree_path_unlock(trans, path2);
bch2_btree_node_lock_write_nofail(trans, path, &ck->c);
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
atomic_long_dec(&c->btree_key_cache.nr_dirty);
}
mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED);
if (bkey_cached_evict(&c->btree_key_cache, ck)) {
bkey_cached_free(trans, &c->btree_key_cache, ck);
} else {
six_unlock_write(&ck->c.lock);
six_unlock_intent(&ck->c.lock);
}
}
out:
bch2_trans_iter_exit(trans, &b_iter);
bch2_trans_iter_exit(trans, &c_iter);
return ret;
}
int bch2_btree_key_cache_journal_flush(struct journal *j,
struct journal_entry_pin *pin, u64 seq)
{
struct bch_fs *c = container_of(j, struct bch_fs, journal);
struct bkey_cached *ck =
container_of(pin, struct bkey_cached, journal);
struct bkey_cached_key key;
struct btree_trans *trans = bch2_trans_get(c);
int srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
int ret = 0;
btree_node_lock_nopath_nofail(trans, &ck->c, SIX_LOCK_read);
key = ck->key;
if (ck->journal.seq != seq ||
!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
six_unlock_read(&ck->c.lock);
goto unlock;
}
if (ck->seq != seq) {
bch2_journal_pin_update(&c->journal, ck->seq, &ck->journal,
bch2_btree_key_cache_journal_flush);
six_unlock_read(&ck->c.lock);
goto unlock;
}
six_unlock_read(&ck->c.lock);
ret = lockrestart_do(trans,
btree_key_cache_flush_pos(trans, key, seq,
BCH_TRANS_COMMIT_journal_reclaim, false));
unlock:
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
bch2_trans_put(trans);
return ret;
}
bool bch2_btree_insert_key_cached(struct btree_trans *trans,
unsigned flags,
struct btree_insert_entry *insert_entry)
{
struct bch_fs *c = trans->c;
struct bkey_cached *ck = (void *) (trans->paths + insert_entry->path)->l[0].b;
struct bkey_i *insert = insert_entry->k;
bool kick_reclaim = false;
BUG_ON(insert->k.u64s > ck->u64s);
bkey_copy(ck->k, insert);
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
EBUG_ON(test_bit(BCH_FS_clean_shutdown, &c->flags));
set_bit(BKEY_CACHED_DIRTY, &ck->flags);
atomic_long_inc(&c->btree_key_cache.nr_dirty);
if (bch2_nr_btree_keys_need_flush(c))
kick_reclaim = true;
}
/*
* To minimize lock contention, we only add the journal pin here and
* defer pin updates to the flush callback via ->seq. Be careful not to
* update ->seq on nojournal commits because we don't want to update the
* pin to a seq that doesn't include journal updates on disk. Otherwise
* we risk losing the update after a crash.
*
* The only exception is if the pin is not active in the first place. We
* have to add the pin because journal reclaim drives key cache
* flushing. The flush callback will not proceed unless ->seq matches
* the latest pin, so make sure it starts with a consistent value.
*/
if (!(insert_entry->flags & BTREE_UPDATE_nojournal) ||
!journal_pin_active(&ck->journal)) {
ck->seq = trans->journal_res.seq;
}
bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
&ck->journal, bch2_btree_key_cache_journal_flush);
if (kick_reclaim)
journal_reclaim_kick(&c->journal);
return true;
}
void bch2_btree_key_cache_drop(struct btree_trans *trans,
struct btree_path *path)
{
struct bch_fs *c = trans->c;
struct btree_key_cache *bc = &c->btree_key_cache;
struct bkey_cached *ck = (void *) path->l[0].b;
/*
* We just did an update to the btree, bypassing the key cache: the key
* cache key is now stale and must be dropped, even if dirty:
*/
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
atomic_long_dec(&c->btree_key_cache.nr_dirty);
bch2_journal_pin_drop(&c->journal, &ck->journal);
}
bkey_cached_evict(bc, ck);
bkey_cached_free(trans, bc, ck);
mark_btree_node_locked(trans, path, 0, BTREE_NODE_UNLOCKED);
struct btree_path *path2;
unsigned i;
trans_for_each_path(trans, path2, i)
if (path2->l[0].b == (void *) ck) {
/*
* It's safe to clear should_be_locked here because
* we're evicting from the key cache, and we still have
* the underlying btree locked: filling into the key
* cache would require taking a write lock on the btree
* node
*/
path2->should_be_locked = false;
__bch2_btree_path_unlock(trans, path2);
path2->l[0].b = ERR_PTR(-BCH_ERR_no_btree_node_drop);
btree_path_set_dirty(trans, path2, BTREE_ITER_NEED_TRAVERSE);
}
bch2_trans_verify_locks(trans);
}
static unsigned long bch2_btree_key_cache_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
struct bch_fs *c = shrink->private_data;
struct btree_key_cache *bc = &c->btree_key_cache;
struct bucket_table *tbl;
struct bkey_cached *ck;
size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
unsigned iter, start;
int srcu_idx;
srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
rcu_read_lock();
tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
/*
* Scanning is expensive while a rehash is in progress - most elements
* will be on the new hashtable, if it's in progress
*
* A rehash could still start while we're scanning - that's ok, we'll
* still see most elements.
*/
if (unlikely(tbl->nest)) {
rcu_read_unlock();
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
return SHRINK_STOP;
}
iter = bc->shrink_iter;
if (iter >= tbl->size)
iter = 0;
start = iter;
do {
struct rhash_head *pos, *next;
pos = rht_ptr_rcu(&tbl->buckets[iter]);
while (!rht_is_a_nulls(pos)) {
next = rht_dereference_bucket_rcu(pos->next, tbl, iter);
ck = container_of(pos, struct bkey_cached, hash);
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
bc->skipped_dirty++;
} else if (test_bit(BKEY_CACHED_ACCESSED, &ck->flags)) {
clear_bit(BKEY_CACHED_ACCESSED, &ck->flags);
bc->skipped_accessed++;
} else if (!bkey_cached_lock_for_evict(ck)) {
bc->skipped_lock_fail++;
} else if (bkey_cached_evict(bc, ck)) {
bkey_cached_free_noassert(bc, ck);
bc->freed++;
freed++;
} else {
six_unlock_write(&ck->c.lock);
six_unlock_intent(&ck->c.lock);
}
scanned++;
if (scanned >= nr)
goto out;
pos = next;
}
iter++;
if (iter >= tbl->size)
iter = 0;
} while (scanned < nr && iter != start);
out:
bc->shrink_iter = iter;
rcu_read_unlock();
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
return freed;
}
static unsigned long bch2_btree_key_cache_count(struct shrinker *shrink,
struct shrink_control *sc)
{
struct bch_fs *c = shrink->private_data;
struct btree_key_cache *bc = &c->btree_key_cache;
long nr = atomic_long_read(&bc->nr_keys) -
atomic_long_read(&bc->nr_dirty);
/*
* Avoid hammering our shrinker too much if it's nearly empty - the
* shrinker code doesn't take into account how big our cache is, if it's
* mostly empty but the system is under memory pressure it causes nasty
* lock contention:
*/
nr -= 128;
return max(0L, nr);
}
void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
{
struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
struct bucket_table *tbl;
struct bkey_cached *ck;
struct rhash_head *pos;
LIST_HEAD(items);
unsigned i;
shrinker_free(bc->shrink);
/*
* The loop is needed to guard against racing with rehash:
*/
while (atomic_long_read(&bc->nr_keys)) {
rcu_read_lock();
tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
if (tbl) {
if (tbl->nest) {
/* wait for in progress rehash */
rcu_read_unlock();
mutex_lock(&bc->table.mutex);
mutex_unlock(&bc->table.mutex);
continue;
}
for (i = 0; i < tbl->size; i++)
while (pos = rht_ptr_rcu(&tbl->buckets[i]), !rht_is_a_nulls(pos)) {
ck = container_of(pos, struct bkey_cached, hash);
BUG_ON(!bkey_cached_evict(bc, ck));
kfree(ck->k);
kmem_cache_free(bch2_key_cache, ck);
}
}
rcu_read_unlock();
}
if (atomic_long_read(&bc->nr_dirty) &&
!bch2_journal_error(&c->journal) &&
test_bit(BCH_FS_was_rw, &c->flags))
panic("btree key cache shutdown error: nr_dirty nonzero (%li)\n",
atomic_long_read(&bc->nr_dirty));
if (atomic_long_read(&bc->nr_keys))
panic("btree key cache shutdown error: nr_keys nonzero (%li)\n",
atomic_long_read(&bc->nr_keys));
if (bc->table_init_done)
rhashtable_destroy(&bc->table);
rcu_pending_exit(&bc->pending[0]);
rcu_pending_exit(&bc->pending[1]);
free_percpu(bc->nr_pending);
}
void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
{
}
int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc)
{
struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
struct shrinker *shrink;
bc->nr_pending = alloc_percpu(size_t);
if (!bc->nr_pending)
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
if (rcu_pending_init(&bc->pending[0], &c->btree_trans_barrier, __bkey_cached_free) ||
rcu_pending_init(&bc->pending[1], &c->btree_trans_barrier, __bkey_cached_free))
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
bc->table_init_done = true;
shrink = shrinker_alloc(0, "%s-btree_key_cache", c->name);
if (!shrink)
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
bc->shrink = shrink;
shrink->count_objects = bch2_btree_key_cache_count;
shrink->scan_objects = bch2_btree_key_cache_scan;
shrink->batch = 1 << 14;
shrink->seeks = 0;
shrink->private_data = c;
shrinker_register(shrink);
return 0;
}
void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *bc)
{
printbuf_tabstop_push(out, 24);
printbuf_tabstop_push(out, 12);
prt_printf(out, "keys:\t%lu\r\n", atomic_long_read(&bc->nr_keys));
prt_printf(out, "dirty:\t%lu\r\n", atomic_long_read(&bc->nr_dirty));
prt_printf(out, "table size:\t%u\r\n", bc->table.tbl->size);
prt_newline(out);
prt_printf(out, "shrinker:\n");
prt_printf(out, "requested_to_free:\t%lu\r\n", bc->requested_to_free);
prt_printf(out, "freed:\t%lu\r\n", bc->freed);
prt_printf(out, "skipped_dirty:\t%lu\r\n", bc->skipped_dirty);
prt_printf(out, "skipped_accessed:\t%lu\r\n", bc->skipped_accessed);
prt_printf(out, "skipped_lock_fail:\t%lu\r\n", bc->skipped_lock_fail);
prt_newline(out);
prt_printf(out, "pending:\t%zu\r\n", per_cpu_sum(bc->nr_pending));
}
void bch2_btree_key_cache_exit(void)
{
kmem_cache_destroy(bch2_key_cache);
}
int __init bch2_btree_key_cache_init(void)
{
bch2_key_cache = KMEM_CACHE(bkey_cached, SLAB_RECLAIM_ACCOUNT);
if (!bch2_key_cache)
return -ENOMEM;
return 0;
}

View File

@ -1,59 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_KEY_CACHE_H
#define _BCACHEFS_BTREE_KEY_CACHE_H
static inline size_t bch2_nr_btree_keys_need_flush(struct bch_fs *c)
{
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
size_t max_dirty = 1024 + nr_keys / 2;
return max_t(ssize_t, 0, nr_dirty - max_dirty);
}
static inline ssize_t __bch2_btree_key_cache_must_wait(struct bch_fs *c)
{
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
size_t max_dirty = 4096 + (nr_keys * 3) / 4;
return nr_dirty - max_dirty;
}
static inline bool bch2_btree_key_cache_must_wait(struct bch_fs *c)
{
return __bch2_btree_key_cache_must_wait(c) > 0;
}
static inline bool bch2_btree_key_cache_wait_done(struct bch_fs *c)
{
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
size_t max_dirty = 2048 + (nr_keys * 5) / 8;
return nr_dirty <= max_dirty;
}
int bch2_btree_key_cache_journal_flush(struct journal *,
struct journal_entry_pin *, u64);
struct bkey_cached *
bch2_btree_key_cache_find(struct bch_fs *, enum btree_id, struct bpos);
int bch2_btree_path_traverse_cached(struct btree_trans *, btree_path_idx_t, unsigned);
bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
struct btree_insert_entry *);
void bch2_btree_key_cache_drop(struct btree_trans *,
struct btree_path *);
void bch2_fs_btree_key_cache_exit(struct btree_key_cache *);
void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *);
int bch2_fs_btree_key_cache_init(struct btree_key_cache *);
void bch2_btree_key_cache_to_text(struct printbuf *, struct btree_key_cache *);
void bch2_btree_key_cache_exit(void);
int __init bch2_btree_key_cache_init(void);
#endif /* _BCACHEFS_BTREE_KEY_CACHE_H */

View File

@ -1,34 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
#define _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
#include "rcu_pending.h"
struct btree_key_cache {
struct rhashtable table;
bool table_init_done;
struct shrinker *shrink;
unsigned shrink_iter;
/* 0: non pcpu reader locks, 1: pcpu reader locks */
struct rcu_pending pending[2];
size_t __percpu *nr_pending;
atomic_long_t nr_keys;
atomic_long_t nr_dirty;
/* shrinker stats */
unsigned long requested_to_free;
unsigned long freed;
unsigned long skipped_dirty;
unsigned long skipped_accessed;
unsigned long skipped_lock_fail;
};
struct bkey_cached_key {
u32 btree_id;
struct bpos pos;
} __packed __aligned(4);
#endif /* _BCACHEFS_BTREE_KEY_CACHE_TYPES_H */

View File

@ -1,936 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "btree_cache.h"
#include "btree_locking.h"
#include "btree_types.h"
static struct lock_class_key bch2_btree_node_lock_key;
void bch2_btree_lock_init(struct btree_bkey_cached_common *b,
enum six_lock_init_flags flags,
gfp_t gfp)
{
__six_lock_init(&b->lock, "b->c.lock", &bch2_btree_node_lock_key, flags, gfp);
lockdep_set_notrack_class(&b->lock);
}
/* Btree node locking: */
struct six_lock_count bch2_btree_node_lock_counts(struct btree_trans *trans,
struct btree_path *skip,
struct btree_bkey_cached_common *b,
unsigned level)
{
struct btree_path *path;
struct six_lock_count ret;
unsigned i;
memset(&ret, 0, sizeof(ret));
if (IS_ERR_OR_NULL(b))
return ret;
trans_for_each_path(trans, path, i)
if (path != skip && &path->l[level].b->c == b) {
int t = btree_node_locked_type(path, level);
if (t != BTREE_NODE_UNLOCKED)
ret.n[t]++;
}
return ret;
}
/* unlock */
void bch2_btree_node_unlock_write(struct btree_trans *trans,
struct btree_path *path, struct btree *b)
{
bch2_btree_node_unlock_write_inlined(trans, path, b);
}
/* lock */
/*
* @trans wants to lock @b with type @type
*/
struct trans_waiting_for_lock {
struct btree_trans *trans;
struct btree_bkey_cached_common *node_want;
enum six_lock_type lock_want;
/* for iterating over held locks :*/
u8 path_idx;
u8 level;
u64 lock_start_time;
};
struct lock_graph {
struct trans_waiting_for_lock g[8];
unsigned nr;
};
static noinline void print_cycle(struct printbuf *out, struct lock_graph *g)
{
struct trans_waiting_for_lock *i;
prt_printf(out, "Found lock cycle (%u entries):\n", g->nr);
for (i = g->g; i < g->g + g->nr; i++) {
struct task_struct *task = READ_ONCE(i->trans->locking_wait.task);
if (!task)
continue;
bch2_btree_trans_to_text(out, i->trans);
bch2_prt_task_backtrace(out, task, i == g->g ? 5 : 1, GFP_NOWAIT);
}
}
static noinline void print_chain(struct printbuf *out, struct lock_graph *g)
{
struct trans_waiting_for_lock *i;
for (i = g->g; i != g->g + g->nr; i++) {
struct task_struct *task = READ_ONCE(i->trans->locking_wait.task);
if (i != g->g)
prt_str(out, "<- ");
prt_printf(out, "%u ", task ? task->pid : 0);
}
prt_newline(out);
}
static void lock_graph_up(struct lock_graph *g)
{
closure_put(&g->g[--g->nr].trans->ref);
}
static noinline void lock_graph_pop_all(struct lock_graph *g)
{
while (g->nr)
lock_graph_up(g);
}
static noinline void lock_graph_pop_from(struct lock_graph *g, struct trans_waiting_for_lock *i)
{
while (g->g + g->nr > i)
lock_graph_up(g);
}
static void __lock_graph_down(struct lock_graph *g, struct btree_trans *trans)
{
g->g[g->nr++] = (struct trans_waiting_for_lock) {
.trans = trans,
.node_want = trans->locking,
.lock_want = trans->locking_wait.lock_want,
};
}
static void lock_graph_down(struct lock_graph *g, struct btree_trans *trans)
{
closure_get(&trans->ref);
__lock_graph_down(g, trans);
}
static bool lock_graph_remove_non_waiters(struct lock_graph *g,
struct trans_waiting_for_lock *from)
{
struct trans_waiting_for_lock *i;
if (from->trans->locking != from->node_want) {
lock_graph_pop_from(g, from);
return true;
}
for (i = from + 1; i < g->g + g->nr; i++)
if (i->trans->locking != i->node_want ||
i->trans->locking_wait.start_time != i[-1].lock_start_time) {
lock_graph_pop_from(g, i);
return true;
}
return false;
}
static void trace_would_deadlock(struct lock_graph *g, struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
count_event(c, trans_restart_would_deadlock);
if (trace_trans_restart_would_deadlock_enabled()) {
struct printbuf buf = PRINTBUF;
buf.atomic++;
print_cycle(&buf, g);
trace_trans_restart_would_deadlock(trans, buf.buf);
printbuf_exit(&buf);
}
}
static int abort_lock(struct lock_graph *g, struct trans_waiting_for_lock *i)
{
if (i == g->g) {
trace_would_deadlock(g, i->trans);
return btree_trans_restart_foreign_task(i->trans,
BCH_ERR_transaction_restart_would_deadlock,
_THIS_IP_);
} else {
i->trans->lock_must_abort = true;
wake_up_process(i->trans->locking_wait.task);
return 0;
}
}
static int btree_trans_abort_preference(struct btree_trans *trans)
{
if (trans->lock_may_not_fail)
return 0;
if (trans->locking_wait.lock_want == SIX_LOCK_write)
return 1;
if (!trans->in_traverse_all)
return 2;
return 3;
}
static noinline __noreturn void break_cycle_fail(struct lock_graph *g)
{
struct printbuf buf = PRINTBUF;
buf.atomic++;
prt_printf(&buf, bch2_fmt(g->g->trans->c, "cycle of nofail locks"));
for (struct trans_waiting_for_lock *i = g->g; i < g->g + g->nr; i++) {
struct btree_trans *trans = i->trans;
bch2_btree_trans_to_text(&buf, trans);
prt_printf(&buf, "backtrace:\n");
printbuf_indent_add(&buf, 2);
bch2_prt_task_backtrace(&buf, trans->locking_wait.task, 2, GFP_NOWAIT);
printbuf_indent_sub(&buf, 2);
prt_newline(&buf);
}
bch2_print_str(g->g->trans->c, KERN_ERR, buf.buf);
printbuf_exit(&buf);
BUG();
}
static noinline int break_cycle(struct lock_graph *g, struct printbuf *cycle,
struct trans_waiting_for_lock *from)
{
struct trans_waiting_for_lock *i, *abort = NULL;
unsigned best = 0, pref;
int ret;
if (lock_graph_remove_non_waiters(g, from))
return 0;
/* Only checking, for debugfs: */
if (cycle) {
print_cycle(cycle, g);
ret = -1;
goto out;
}
for (i = from; i < g->g + g->nr; i++) {
pref = btree_trans_abort_preference(i->trans);
if (pref > best) {
abort = i;
best = pref;
}
}
if (unlikely(!best))
break_cycle_fail(g);
ret = abort_lock(g, abort);
out:
if (ret)
lock_graph_pop_all(g);
else
lock_graph_pop_from(g, abort);
return ret;
}
static int lock_graph_descend(struct lock_graph *g, struct btree_trans *trans,
struct printbuf *cycle)
{
struct btree_trans *orig_trans = g->g->trans;
for (struct trans_waiting_for_lock *i = g->g; i < g->g + g->nr; i++)
if (i->trans == trans) {
closure_put(&trans->ref);
return break_cycle(g, cycle, i);
}
if (unlikely(g->nr == ARRAY_SIZE(g->g))) {
closure_put(&trans->ref);
if (orig_trans->lock_may_not_fail)
return 0;
lock_graph_pop_all(g);
if (cycle)
return 0;
trace_and_count(trans->c, trans_restart_would_deadlock_recursion_limit, trans, _RET_IP_);
return btree_trans_restart(orig_trans, BCH_ERR_transaction_restart_deadlock_recursion_limit);
}
__lock_graph_down(g, trans);
return 0;
}
static bool lock_type_conflicts(enum six_lock_type t1, enum six_lock_type t2)
{
return t1 + t2 > 1;
}
int bch2_check_for_deadlock(struct btree_trans *trans, struct printbuf *cycle)
{
struct lock_graph g;
struct trans_waiting_for_lock *top;
struct btree_bkey_cached_common *b;
btree_path_idx_t path_idx;
int ret = 0;
g.nr = 0;
if (trans->lock_must_abort && !trans->lock_may_not_fail) {
if (cycle)
return -1;
trace_would_deadlock(&g, trans);
return btree_trans_restart(trans, BCH_ERR_transaction_restart_would_deadlock);
}
lock_graph_down(&g, trans);
/* trans->paths is rcu protected vs. freeing */
guard(rcu)();
if (cycle)
cycle->atomic++;
next:
if (!g.nr)
goto out;
top = &g.g[g.nr - 1];
struct btree_path *paths = rcu_dereference(top->trans->paths);
if (!paths)
goto up;
unsigned long *paths_allocated = trans_paths_allocated(paths);
trans_for_each_path_idx_from(paths_allocated, *trans_paths_nr(paths),
path_idx, top->path_idx) {
struct btree_path *path = paths + path_idx;
if (!path->nodes_locked)
continue;
if (path_idx != top->path_idx) {
top->path_idx = path_idx;
top->level = 0;
top->lock_start_time = 0;
}
for (;
top->level < BTREE_MAX_DEPTH;
top->level++, top->lock_start_time = 0) {
int lock_held = btree_node_locked_type(path, top->level);
if (lock_held == BTREE_NODE_UNLOCKED)
continue;
b = &READ_ONCE(path->l[top->level].b)->c;
if (IS_ERR_OR_NULL(b)) {
/*
* If we get here, it means we raced with the
* other thread updating its btree_path
* structures - which means it can't be blocked
* waiting on a lock:
*/
if (!lock_graph_remove_non_waiters(&g, g.g)) {
/*
* If lock_graph_remove_non_waiters()
* didn't do anything, it must be
* because we're being called by debugfs
* checking for lock cycles, which
* invokes us on btree_transactions that
* aren't actually waiting on anything.
* Just bail out:
*/
lock_graph_pop_all(&g);
}
goto next;
}
if (list_empty_careful(&b->lock.wait_list))
continue;
raw_spin_lock(&b->lock.wait_lock);
list_for_each_entry(trans, &b->lock.wait_list, locking_wait.list) {
BUG_ON(b != trans->locking);
if (top->lock_start_time &&
time_after_eq64(top->lock_start_time, trans->locking_wait.start_time))
continue;
top->lock_start_time = trans->locking_wait.start_time;
/* Don't check for self deadlock: */
if (trans == top->trans ||
!lock_type_conflicts(lock_held, trans->locking_wait.lock_want))
continue;
closure_get(&trans->ref);
raw_spin_unlock(&b->lock.wait_lock);
ret = lock_graph_descend(&g, trans, cycle);
if (ret)
goto out;
goto next;
}
raw_spin_unlock(&b->lock.wait_lock);
}
}
up:
if (g.nr > 1 && cycle)
print_chain(cycle, &g);
lock_graph_up(&g);
goto next;
out:
if (cycle)
--cycle->atomic;
return ret;
}
int bch2_six_check_for_deadlock(struct six_lock *lock, void *p)
{
struct btree_trans *trans = p;
return bch2_check_for_deadlock(trans, NULL);
}
int __bch2_btree_node_lock_write(struct btree_trans *trans, struct btree_path *path,
struct btree_bkey_cached_common *b,
bool lock_may_not_fail)
{
int readers = bch2_btree_node_lock_counts(trans, NULL, b, b->level).n[SIX_LOCK_read];
int ret;
/*
* Must drop our read locks before calling six_lock_write() -
* six_unlock() won't do wakeups until the reader count
* goes to 0, and it's safe because we have the node intent
* locked:
*/
six_lock_readers_add(&b->lock, -readers);
ret = __btree_node_lock_nopath(trans, b, SIX_LOCK_write,
lock_may_not_fail, _RET_IP_);
six_lock_readers_add(&b->lock, readers);
if (ret)
mark_btree_node_locked_noreset(path, b->level, BTREE_NODE_INTENT_LOCKED);
return ret;
}
void bch2_btree_node_lock_write_nofail(struct btree_trans *trans,
struct btree_path *path,
struct btree_bkey_cached_common *b)
{
int ret = __btree_node_lock_write(trans, path, b, true);
BUG_ON(ret);
}
/* relock */
static int btree_path_get_locks(struct btree_trans *trans,
struct btree_path *path,
bool upgrade,
struct get_locks_fail *f,
int restart_err)
{
unsigned l = path->level;
do {
if (!btree_path_node(path, l))
break;
if (!(upgrade
? bch2_btree_node_upgrade(trans, path, l)
: bch2_btree_node_relock(trans, path, l)))
goto err;
l++;
} while (l < path->locks_want);
if (path->uptodate == BTREE_ITER_NEED_RELOCK)
path->uptodate = BTREE_ITER_UPTODATE;
return path->uptodate < BTREE_ITER_NEED_RELOCK ? 0 : -1;
err:
if (f) {
f->l = l;
f->b = path->l[l].b;
}
/*
* Do transaction restart before unlocking, so we don't pop
* should_be_locked asserts
*/
if (restart_err) {
btree_trans_restart(trans, restart_err);
} else if (path->should_be_locked && !trans->restarted) {
if (upgrade)
path->locks_want = l;
return -1;
}
__bch2_btree_path_unlock(trans, path);
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
/*
* When we fail to get a lock, we have to ensure that any child nodes
* can't be relocked so bch2_btree_path_traverse has to walk back up to
* the node that we failed to relock:
*/
do {
path->l[l].b = upgrade
? ERR_PTR(-BCH_ERR_no_btree_node_upgrade)
: ERR_PTR(-BCH_ERR_no_btree_node_relock);
} while (l--);
return -restart_err ?: -1;
}
bool __bch2_btree_node_relock(struct btree_trans *trans,
struct btree_path *path, unsigned level,
bool trace)
{
struct btree *b = btree_path_node(path, level);
int want = __btree_lock_want(path, level);
if (race_fault())
goto fail;
if (six_relock_type(&b->c.lock, want, path->l[level].lock_seq) ||
(btree_node_lock_seq_matches(path, b, level) &&
btree_node_lock_increment(trans, &b->c, level, want))) {
mark_btree_node_locked(trans, path, level, want);
return true;
}
fail:
if (trace && !trans->notrace_relock_fail)
trace_and_count(trans->c, btree_path_relock_fail, trans, _RET_IP_, path, level);
return false;
}
/* upgrade */
bool bch2_btree_node_upgrade(struct btree_trans *trans,
struct btree_path *path, unsigned level)
{
struct btree *b = path->l[level].b;
if (!is_btree_node(path, level))
return false;
switch (btree_lock_want(path, level)) {
case BTREE_NODE_UNLOCKED:
BUG_ON(btree_node_locked(path, level));
return true;
case BTREE_NODE_READ_LOCKED:
BUG_ON(btree_node_intent_locked(path, level));
return bch2_btree_node_relock(trans, path, level);
case BTREE_NODE_INTENT_LOCKED:
break;
case BTREE_NODE_WRITE_LOCKED:
BUG();
}
if (btree_node_intent_locked(path, level))
return true;
if (race_fault())
return false;
if (btree_node_locked(path, level)
? six_lock_tryupgrade(&b->c.lock)
: six_relock_type(&b->c.lock, SIX_LOCK_intent, path->l[level].lock_seq))
goto success;
if (btree_node_lock_seq_matches(path, b, level) &&
btree_node_lock_increment(trans, &b->c, level, BTREE_NODE_INTENT_LOCKED)) {
btree_node_unlock(trans, path, level);
goto success;
}
trace_and_count(trans->c, btree_path_upgrade_fail, trans, _RET_IP_, path, level);
return false;
success:
mark_btree_node_locked_noreset(path, level, BTREE_NODE_INTENT_LOCKED);
return true;
}
/* Btree path locking: */
/*
* Only for btree_cache.c - only relocks intent locks
*/
int bch2_btree_path_relock_intent(struct btree_trans *trans,
struct btree_path *path)
{
unsigned l;
for (l = path->level;
l < path->locks_want && btree_path_node(path, l);
l++) {
if (!bch2_btree_node_relock(trans, path, l)) {
__bch2_btree_path_unlock(trans, path);
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
trace_and_count(trans->c, trans_restart_relock_path_intent, trans, _RET_IP_, path);
return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path_intent);
}
}
return 0;
}
__flatten
bool bch2_btree_path_relock_norestart(struct btree_trans *trans, struct btree_path *path)
{
bool ret = !btree_path_get_locks(trans, path, false, NULL, 0);
bch2_trans_verify_locks(trans);
return ret;
}
int __bch2_btree_path_relock(struct btree_trans *trans,
struct btree_path *path, unsigned long trace_ip)
{
if (!bch2_btree_path_relock_norestart(trans, path)) {
trace_and_count(trans->c, trans_restart_relock_path, trans, trace_ip, path);
return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path);
}
return 0;
}
bool __bch2_btree_path_upgrade_norestart(struct btree_trans *trans,
struct btree_path *path,
unsigned new_locks_want)
{
path->locks_want = new_locks_want;
/*
* If we need it locked, we can't touch it. Otherwise, we can return
* success - bch2_path_get() will use this path, and it'll just be
* retraversed:
*/
bool ret = !btree_path_get_locks(trans, path, true, NULL, 0) ||
!path->should_be_locked;
bch2_btree_path_verify_locks(trans, path);
return ret;
}
int __bch2_btree_path_upgrade(struct btree_trans *trans,
struct btree_path *path,
unsigned new_locks_want)
{
unsigned old_locks = path->nodes_locked;
unsigned old_locks_want = path->locks_want;
path->locks_want = max_t(unsigned, path->locks_want, new_locks_want);
struct get_locks_fail f = {};
int ret = btree_path_get_locks(trans, path, true, &f,
BCH_ERR_transaction_restart_upgrade);
if (!ret)
goto out;
/*
* XXX: this is ugly - we'd prefer to not be mucking with other
* iterators in the btree_trans here.
*
* On failure to upgrade the iterator, setting iter->locks_want and
* calling get_locks() is sufficient to make bch2_btree_path_traverse()
* get the locks we want on transaction restart.
*
* But if this iterator was a clone, on transaction restart what we did
* to this iterator isn't going to be preserved.
*
* Possibly we could add an iterator field for the parent iterator when
* an iterator is a copy - for now, we'll just upgrade any other
* iterators with the same btree id.
*
* The code below used to be needed to ensure ancestor nodes get locked
* before interior nodes - now that's handled by
* bch2_btree_path_traverse_all().
*/
if (!path->cached && !trans->in_traverse_all) {
struct btree_path *linked;
unsigned i;
trans_for_each_path(trans, linked, i)
if (linked != path &&
linked->cached == path->cached &&
linked->btree_id == path->btree_id &&
linked->locks_want < new_locks_want) {
linked->locks_want = new_locks_want;
btree_path_get_locks(trans, linked, true, NULL, 0);
}
}
count_event(trans->c, trans_restart_upgrade);
if (trace_trans_restart_upgrade_enabled()) {
struct printbuf buf = PRINTBUF;
prt_printf(&buf, "%s %pS\n", trans->fn, (void *) _RET_IP_);
prt_printf(&buf, "btree %s pos\n", bch2_btree_id_str(path->btree_id));
bch2_bpos_to_text(&buf, path->pos);
prt_printf(&buf, "locks want %u -> %u level %u\n",
old_locks_want, new_locks_want, f.l);
prt_printf(&buf, "nodes_locked %x -> %x\n",
old_locks, path->nodes_locked);
prt_printf(&buf, "node %s ", IS_ERR(f.b) ? bch2_err_str(PTR_ERR(f.b)) :
!f.b ? "(null)" : "(node)");
prt_printf(&buf, "path seq %u node seq %u\n",
IS_ERR_OR_NULL(f.b) ? 0 : f.b->c.lock.seq,
path->l[f.l].lock_seq);
trace_trans_restart_upgrade(trans->c, buf.buf);
printbuf_exit(&buf);
}
out:
bch2_trans_verify_locks(trans);
return ret;
}
void __bch2_btree_path_downgrade(struct btree_trans *trans,
struct btree_path *path,
unsigned new_locks_want)
{
unsigned l, old_locks_want = path->locks_want;
if (trans->restarted)
return;
EBUG_ON(path->locks_want < new_locks_want);
path->locks_want = new_locks_want;
while (path->nodes_locked &&
(l = btree_path_highest_level_locked(path)) >= path->locks_want) {
if (l > path->level) {
btree_node_unlock(trans, path, l);
} else {
if (btree_node_intent_locked(path, l)) {
six_lock_downgrade(&path->l[l].b->c.lock);
mark_btree_node_locked_noreset(path, l, BTREE_NODE_READ_LOCKED);
}
break;
}
}
bch2_btree_path_verify_locks(trans, path);
trace_path_downgrade(trans, _RET_IP_, path, old_locks_want);
}
/* Btree transaction locking: */
void bch2_trans_downgrade(struct btree_trans *trans)
{
struct btree_path *path;
unsigned i;
if (trans->restarted)
return;
trans_for_each_path(trans, path, i)
if (path->ref)
bch2_btree_path_downgrade(trans, path);
}
static inline void __bch2_trans_unlock(struct btree_trans *trans)
{
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i)
__bch2_btree_path_unlock(trans, path);
}
static noinline __cold void bch2_trans_relock_fail(struct btree_trans *trans, struct btree_path *path,
struct get_locks_fail *f, bool trace, ulong ip)
{
if (!trace)
goto out;
if (trace_trans_restart_relock_enabled()) {
struct printbuf buf = PRINTBUF;
bch2_bpos_to_text(&buf, path->pos);
prt_printf(&buf, " %s l=%u seq=%u node seq=",
bch2_btree_id_str(path->btree_id),
f->l, path->l[f->l].lock_seq);
if (IS_ERR_OR_NULL(f->b)) {
prt_str(&buf, bch2_err_str(PTR_ERR(f->b)));
} else {
prt_printf(&buf, "%u", f->b->c.lock.seq);
struct six_lock_count c =
bch2_btree_node_lock_counts(trans, NULL, &f->b->c, f->l);
prt_printf(&buf, " self locked %u.%u.%u", c.n[0], c.n[1], c.n[2]);
c = six_lock_counts(&f->b->c.lock);
prt_printf(&buf, " total locked %u.%u.%u", c.n[0], c.n[1], c.n[2]);
}
trace_trans_restart_relock(trans, ip, buf.buf);
printbuf_exit(&buf);
}
count_event(trans->c, trans_restart_relock);
out:
__bch2_trans_unlock(trans);
bch2_trans_verify_locks(trans);
}
static inline int __bch2_trans_relock(struct btree_trans *trans, bool trace, ulong ip)
{
bch2_trans_verify_locks(trans);
if (unlikely(trans->restarted))
return -((int) trans->restarted);
if (unlikely(trans->locked))
goto out;
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i) {
struct get_locks_fail f;
int ret;
if (path->should_be_locked &&
(ret = btree_path_get_locks(trans, path, false, &f,
BCH_ERR_transaction_restart_relock))) {
bch2_trans_relock_fail(trans, path, &f, trace, ip);
return ret;
}
}
trans_set_locked(trans, true);
out:
bch2_trans_verify_locks(trans);
return 0;
}
int bch2_trans_relock(struct btree_trans *trans)
{
return __bch2_trans_relock(trans, true, _RET_IP_);
}
int bch2_trans_relock_notrace(struct btree_trans *trans)
{
return __bch2_trans_relock(trans, false, _RET_IP_);
}
void bch2_trans_unlock(struct btree_trans *trans)
{
trans_set_unlocked(trans);
__bch2_trans_unlock(trans);
}
void bch2_trans_unlock_long(struct btree_trans *trans)
{
bch2_trans_unlock(trans);
bch2_trans_srcu_unlock(trans);
}
void bch2_trans_unlock_write(struct btree_trans *trans)
{
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i)
for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++)
if (btree_node_write_locked(path, l))
bch2_btree_node_unlock_write(trans, path, path->l[l].b);
}
int __bch2_trans_mutex_lock(struct btree_trans *trans,
struct mutex *lock)
{
int ret = drop_locks_do(trans, (mutex_lock(lock), 0));
if (ret)
mutex_unlock(lock);
return ret;
}
/* Debug */
void __bch2_btree_path_verify_locks(struct btree_trans *trans, struct btree_path *path)
{
if (!path->nodes_locked && btree_path_node(path, path->level)) {
/*
* A path may be uptodate and yet have nothing locked if and only if
* there is no node at path->level, which generally means we were
* iterating over all nodes and got to the end of the btree
*/
BUG_ON(path->uptodate == BTREE_ITER_UPTODATE);
BUG_ON(path->should_be_locked && trans->locked && !trans->restarted);
}
if (!path->nodes_locked)
return;
for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++) {
int want = btree_lock_want(path, l);
int have = btree_node_locked_type_nowrite(path, l);
BUG_ON(!is_btree_node(path, l) && have != BTREE_NODE_UNLOCKED);
BUG_ON(is_btree_node(path, l) && want != have);
BUG_ON(btree_node_locked(path, l) &&
path->l[l].lock_seq != six_lock_seq(&path->l[l].b->c.lock));
}
}
static bool bch2_trans_locked(struct btree_trans *trans)
{
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i)
if (path->nodes_locked)
return true;
return false;
}
void __bch2_trans_verify_locks(struct btree_trans *trans)
{
if (!trans->locked) {
BUG_ON(bch2_trans_locked(trans));
return;
}
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i)
__bch2_btree_path_verify_locks(trans, path);
}

View File

@ -1,466 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_LOCKING_H
#define _BCACHEFS_BTREE_LOCKING_H
/*
* Only for internal btree use:
*
* The btree iterator tracks what locks it wants to take, and what locks it
* currently has - here we have wrappers for locking/unlocking btree nodes and
* updating the iterator state
*/
#include "btree_iter.h"
#include "six.h"
void bch2_btree_lock_init(struct btree_bkey_cached_common *, enum six_lock_init_flags, gfp_t gfp);
void bch2_trans_unlock_write(struct btree_trans *);
static inline bool is_btree_node(struct btree_path *path, unsigned l)
{
return l < BTREE_MAX_DEPTH && !IS_ERR_OR_NULL(path->l[l].b);
}
static inline struct btree_transaction_stats *btree_trans_stats(struct btree_trans *trans)
{
return trans->fn_idx < ARRAY_SIZE(trans->c->btree_transaction_stats)
? &trans->c->btree_transaction_stats[trans->fn_idx]
: NULL;
}
/* matches six lock types */
enum btree_node_locked_type {
BTREE_NODE_UNLOCKED = -1,
BTREE_NODE_READ_LOCKED = SIX_LOCK_read,
BTREE_NODE_INTENT_LOCKED = SIX_LOCK_intent,
BTREE_NODE_WRITE_LOCKED = SIX_LOCK_write,
};
static inline int btree_node_locked_type(struct btree_path *path,
unsigned level)
{
return BTREE_NODE_UNLOCKED + ((path->nodes_locked >> (level << 1)) & 3);
}
static inline int btree_node_locked_type_nowrite(struct btree_path *path,
unsigned level)
{
int have = btree_node_locked_type(path, level);
return have == BTREE_NODE_WRITE_LOCKED
? BTREE_NODE_INTENT_LOCKED
: have;
}
static inline bool btree_node_write_locked(struct btree_path *path, unsigned l)
{
return btree_node_locked_type(path, l) == BTREE_NODE_WRITE_LOCKED;
}
static inline bool btree_node_intent_locked(struct btree_path *path, unsigned l)
{
return btree_node_locked_type(path, l) == BTREE_NODE_INTENT_LOCKED;
}
static inline bool btree_node_read_locked(struct btree_path *path, unsigned l)
{
return btree_node_locked_type(path, l) == BTREE_NODE_READ_LOCKED;
}
static inline bool btree_node_locked(struct btree_path *path, unsigned level)
{
return btree_node_locked_type(path, level) != BTREE_NODE_UNLOCKED;
}
static inline void mark_btree_node_locked_noreset(struct btree_path *path,
unsigned level,
enum btree_node_locked_type type)
{
/* relying on this to avoid a branch */
BUILD_BUG_ON(SIX_LOCK_read != 0);
BUILD_BUG_ON(SIX_LOCK_intent != 1);
path->nodes_locked &= ~(3U << (level << 1));
path->nodes_locked |= (type + 1) << (level << 1);
}
static inline void mark_btree_node_locked(struct btree_trans *trans,
struct btree_path *path,
unsigned level,
enum btree_node_locked_type type)
{
mark_btree_node_locked_noreset(path, level, (enum btree_node_locked_type) type);
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
path->l[level].lock_taken_time = local_clock();
#endif
}
static inline enum six_lock_type __btree_lock_want(struct btree_path *path, int level)
{
return level < path->locks_want
? SIX_LOCK_intent
: SIX_LOCK_read;
}
static inline enum btree_node_locked_type
btree_lock_want(struct btree_path *path, int level)
{
if (level < path->level)
return BTREE_NODE_UNLOCKED;
if (level < path->locks_want)
return BTREE_NODE_INTENT_LOCKED;
if (level == path->level)
return BTREE_NODE_READ_LOCKED;
return BTREE_NODE_UNLOCKED;
}
static void btree_trans_lock_hold_time_update(struct btree_trans *trans,
struct btree_path *path, unsigned level)
{
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
__bch2_time_stats_update(&btree_trans_stats(trans)->lock_hold_times,
path->l[level].lock_taken_time,
local_clock());
#endif
}
/* unlock: */
void bch2_btree_node_unlock_write(struct btree_trans *,
struct btree_path *, struct btree *);
static inline void btree_node_unlock(struct btree_trans *trans,
struct btree_path *path, unsigned level)
{
int lock_type = btree_node_locked_type(path, level);
EBUG_ON(level >= BTREE_MAX_DEPTH);
if (lock_type != BTREE_NODE_UNLOCKED) {
if (unlikely(lock_type == BTREE_NODE_WRITE_LOCKED)) {
bch2_btree_node_unlock_write(trans, path, path->l[level].b);
lock_type = BTREE_NODE_INTENT_LOCKED;
}
six_unlock_type(&path->l[level].b->c.lock, lock_type);
btree_trans_lock_hold_time_update(trans, path, level);
mark_btree_node_locked_noreset(path, level, BTREE_NODE_UNLOCKED);
}
}
static inline int btree_path_lowest_level_locked(struct btree_path *path)
{
return __ffs(path->nodes_locked) >> 1;
}
static inline int btree_path_highest_level_locked(struct btree_path *path)
{
return __fls(path->nodes_locked) >> 1;
}
static inline void __bch2_btree_path_unlock(struct btree_trans *trans,
struct btree_path *path)
{
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_RELOCK);
while (path->nodes_locked)
btree_node_unlock(trans, path, btree_path_lowest_level_locked(path));
}
/*
* Updates the saved lock sequence number, so that bch2_btree_node_relock() will
* succeed:
*/
static inline void
__bch2_btree_node_unlock_write(struct btree_trans *trans, struct btree *b)
{
if (!b->c.lock.write_lock_recurse) {
struct btree_path *linked;
unsigned i;
trans_for_each_path_with_node(trans, b, linked, i)
linked->l[b->c.level].lock_seq++;
}
six_unlock_write(&b->c.lock);
}
static inline void
bch2_btree_node_unlock_write_inlined(struct btree_trans *trans, struct btree_path *path,
struct btree *b)
{
EBUG_ON(path->l[b->c.level].b != b);
EBUG_ON(path->l[b->c.level].lock_seq != six_lock_seq(&b->c.lock));
EBUG_ON(btree_node_locked_type(path, b->c.level) != SIX_LOCK_write);
mark_btree_node_locked_noreset(path, b->c.level, BTREE_NODE_INTENT_LOCKED);
__bch2_btree_node_unlock_write(trans, b);
}
int bch2_six_check_for_deadlock(struct six_lock *lock, void *p);
/* lock: */
static inline void trans_set_locked(struct btree_trans *trans, bool try)
{
if (!trans->locked) {
lock_acquire_exclusive(&trans->dep_map, 0, try, NULL, _THIS_IP_);
trans->locked = true;
trans->last_unlock_ip = 0;
trans->pf_memalloc_nofs = (current->flags & PF_MEMALLOC_NOFS) != 0;
current->flags |= PF_MEMALLOC_NOFS;
}
}
static inline void trans_set_unlocked(struct btree_trans *trans)
{
if (trans->locked) {
lock_release(&trans->dep_map, _THIS_IP_);
trans->locked = false;
trans->last_unlock_ip = _RET_IP_;
if (!trans->pf_memalloc_nofs)
current->flags &= ~PF_MEMALLOC_NOFS;
}
}
static inline int __btree_node_lock_nopath(struct btree_trans *trans,
struct btree_bkey_cached_common *b,
enum six_lock_type type,
bool lock_may_not_fail,
unsigned long ip)
{
trans->lock_may_not_fail = lock_may_not_fail;
trans->lock_must_abort = false;
trans->locking = b;
int ret = six_lock_ip_waiter(&b->lock, type, &trans->locking_wait,
bch2_six_check_for_deadlock, trans, ip);
WRITE_ONCE(trans->locking, NULL);
WRITE_ONCE(trans->locking_wait.start_time, 0);
if (!ret)
trace_btree_path_lock(trans, _THIS_IP_, b);
return ret;
}
static inline int __must_check
btree_node_lock_nopath(struct btree_trans *trans,
struct btree_bkey_cached_common *b,
enum six_lock_type type,
unsigned long ip)
{
return __btree_node_lock_nopath(trans, b, type, false, ip);
}
static inline void btree_node_lock_nopath_nofail(struct btree_trans *trans,
struct btree_bkey_cached_common *b,
enum six_lock_type type)
{
int ret = __btree_node_lock_nopath(trans, b, type, true, _THIS_IP_);
BUG_ON(ret);
}
/*
* Lock a btree node if we already have it locked on one of our linked
* iterators:
*/
static inline bool btree_node_lock_increment(struct btree_trans *trans,
struct btree_bkey_cached_common *b,
unsigned level,
enum btree_node_locked_type want)
{
struct btree_path *path;
unsigned i;
trans_for_each_path(trans, path, i)
if (&path->l[level].b->c == b &&
btree_node_locked_type(path, level) >= want) {
six_lock_increment(&b->lock, (enum six_lock_type) want);
return true;
}
return false;
}
static inline int btree_node_lock(struct btree_trans *trans,
struct btree_path *path,
struct btree_bkey_cached_common *b,
unsigned level,
enum six_lock_type type,
unsigned long ip)
{
int ret = 0;
EBUG_ON(level >= BTREE_MAX_DEPTH);
bch2_trans_verify_not_unlocked_or_in_restart(trans);
if (likely(six_trylock_type(&b->lock, type)) ||
btree_node_lock_increment(trans, b, level, (enum btree_node_locked_type) type) ||
!(ret = btree_node_lock_nopath(trans, b, type, btree_path_ip_allocated(path)))) {
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
path->l[b->level].lock_taken_time = local_clock();
#endif
}
return ret;
}
int __bch2_btree_node_lock_write(struct btree_trans *, struct btree_path *,
struct btree_bkey_cached_common *b, bool);
static inline int __btree_node_lock_write(struct btree_trans *trans,
struct btree_path *path,
struct btree_bkey_cached_common *b,
bool lock_may_not_fail)
{
EBUG_ON(&path->l[b->level].b->c != b);
EBUG_ON(path->l[b->level].lock_seq != six_lock_seq(&b->lock));
EBUG_ON(!btree_node_intent_locked(path, b->level));
/*
* six locks are unfair, and read locks block while a thread wants a
* write lock: thus, we need to tell the cycle detector we have a write
* lock _before_ taking the lock:
*/
mark_btree_node_locked_noreset(path, b->level, BTREE_NODE_WRITE_LOCKED);
return likely(six_trylock_write(&b->lock))
? 0
: __bch2_btree_node_lock_write(trans, path, b, lock_may_not_fail);
}
static inline int __must_check
bch2_btree_node_lock_write(struct btree_trans *trans,
struct btree_path *path,
struct btree_bkey_cached_common *b)
{
return __btree_node_lock_write(trans, path, b, false);
}
void bch2_btree_node_lock_write_nofail(struct btree_trans *,
struct btree_path *,
struct btree_bkey_cached_common *);
/* relock: */
bool bch2_btree_path_relock_norestart(struct btree_trans *, struct btree_path *);
int __bch2_btree_path_relock(struct btree_trans *,
struct btree_path *, unsigned long);
static inline int bch2_btree_path_relock(struct btree_trans *trans,
struct btree_path *path, unsigned long trace_ip)
{
return btree_node_locked(path, path->level)
? 0
: __bch2_btree_path_relock(trans, path, trace_ip);
}
bool __bch2_btree_node_relock(struct btree_trans *, struct btree_path *, unsigned, bool trace);
static inline bool bch2_btree_node_relock(struct btree_trans *trans,
struct btree_path *path, unsigned level)
{
EBUG_ON(btree_node_locked(path, level) &&
!btree_node_write_locked(path, level) &&
btree_node_locked_type(path, level) != __btree_lock_want(path, level));
return likely(btree_node_locked(path, level)) ||
(!IS_ERR_OR_NULL(path->l[level].b) &&
__bch2_btree_node_relock(trans, path, level, true));
}
static inline bool bch2_btree_node_relock_notrace(struct btree_trans *trans,
struct btree_path *path, unsigned level)
{
EBUG_ON(btree_node_locked(path, level) &&
btree_node_locked_type_nowrite(path, level) !=
__btree_lock_want(path, level));
return likely(btree_node_locked(path, level)) ||
(!IS_ERR_OR_NULL(path->l[level].b) &&
__bch2_btree_node_relock(trans, path, level, false));
}
/* upgrade */
bool __bch2_btree_path_upgrade_norestart(struct btree_trans *, struct btree_path *, unsigned);
static inline bool bch2_btree_path_upgrade_norestart(struct btree_trans *trans,
struct btree_path *path,
unsigned new_locks_want)
{
return new_locks_want > path->locks_want
? __bch2_btree_path_upgrade_norestart(trans, path, new_locks_want)
: true;
}
int __bch2_btree_path_upgrade(struct btree_trans *,
struct btree_path *, unsigned);
static inline int bch2_btree_path_upgrade(struct btree_trans *trans,
struct btree_path *path,
unsigned new_locks_want)
{
new_locks_want = min(new_locks_want, BTREE_MAX_DEPTH);
return likely(path->locks_want >= new_locks_want && path->nodes_locked)
? 0
: __bch2_btree_path_upgrade(trans, path, new_locks_want);
}
/* misc: */
static inline void btree_path_set_should_be_locked(struct btree_trans *trans, struct btree_path *path)
{
EBUG_ON(!btree_node_locked(path, path->level));
EBUG_ON(path->uptodate);
if (!path->should_be_locked) {
path->should_be_locked = true;
trace_btree_path_should_be_locked(trans, path);
}
}
static inline void __btree_path_set_level_up(struct btree_trans *trans,
struct btree_path *path,
unsigned l)
{
btree_node_unlock(trans, path, l);
path->l[l].b = ERR_PTR(-BCH_ERR_no_btree_node_up);
}
static inline void btree_path_set_level_up(struct btree_trans *trans,
struct btree_path *path)
{
__btree_path_set_level_up(trans, path, path->level++);
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
}
/* debug */
struct six_lock_count bch2_btree_node_lock_counts(struct btree_trans *,
struct btree_path *,
struct btree_bkey_cached_common *b,
unsigned);
int bch2_check_for_deadlock(struct btree_trans *, struct printbuf *);
void __bch2_btree_path_verify_locks(struct btree_trans *, struct btree_path *);
void __bch2_trans_verify_locks(struct btree_trans *);
static inline void bch2_btree_path_verify_locks(struct btree_trans *trans,
struct btree_path *path)
{
if (static_branch_unlikely(&bch2_debug_check_btree_locking))
__bch2_btree_path_verify_locks(trans, path);
}
static inline void bch2_trans_verify_locks(struct btree_trans *trans)
{
if (static_branch_unlikely(&bch2_debug_check_btree_locking))
__bch2_trans_verify_locks(trans);
}
#endif /* _BCACHEFS_BTREE_LOCKING_H */

View File

@ -1,611 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "btree_cache.h"
#include "btree_io.h"
#include "btree_journal_iter.h"
#include "btree_node_scan.h"
#include "btree_update_interior.h"
#include "buckets.h"
#include "error.h"
#include "journal_io.h"
#include "recovery_passes.h"
#include <linux/kthread.h>
#include <linux/min_heap.h>
#include <linux/sched/sysctl.h>
#include <linux/sort.h>
struct find_btree_nodes_worker {
struct closure *cl;
struct find_btree_nodes *f;
struct bch_dev *ca;
};
static void found_btree_node_to_text(struct printbuf *out, struct bch_fs *c, const struct found_btree_node *n)
{
bch2_btree_id_level_to_text(out, n->btree_id, n->level);
prt_printf(out, " seq=%u journal_seq=%llu cookie=%llx ",
n->seq, n->journal_seq, n->cookie);
bch2_bpos_to_text(out, n->min_key);
prt_str(out, "-");
bch2_bpos_to_text(out, n->max_key);
if (n->range_updated)
prt_str(out, " range updated");
for (unsigned i = 0; i < n->nr_ptrs; i++) {
prt_char(out, ' ');
bch2_extent_ptr_to_text(out, c, n->ptrs + i);
}
}
static void found_btree_nodes_to_text(struct printbuf *out, struct bch_fs *c, found_btree_nodes nodes)
{
printbuf_indent_add(out, 2);
darray_for_each(nodes, i) {
found_btree_node_to_text(out, c, i);
prt_newline(out);
}
printbuf_indent_sub(out, 2);
}
static void found_btree_node_to_key(struct bkey_i *k, const struct found_btree_node *f)
{
struct bkey_i_btree_ptr_v2 *bp = bkey_btree_ptr_v2_init(k);
set_bkey_val_u64s(&bp->k, sizeof(struct bch_btree_ptr_v2) / sizeof(u64) + f->nr_ptrs);
bp->k.p = f->max_key;
bp->v.seq = cpu_to_le64(f->cookie);
bp->v.sectors_written = 0;
bp->v.flags = 0;
bp->v.sectors_written = cpu_to_le16(f->sectors_written);
bp->v.min_key = f->min_key;
SET_BTREE_PTR_RANGE_UPDATED(&bp->v, f->range_updated);
memcpy(bp->v.start, f->ptrs, sizeof(struct bch_extent_ptr) * f->nr_ptrs);
}
static inline u64 bkey_journal_seq(struct bkey_s_c k)
{
switch (k.k->type) {
case KEY_TYPE_inode_v3:
return le64_to_cpu(bkey_s_c_to_inode_v3(k).v->bi_journal_seq);
default:
return 0;
}
}
static int found_btree_node_cmp_cookie(const void *_l, const void *_r)
{
const struct found_btree_node *l = _l;
const struct found_btree_node *r = _r;
return cmp_int(l->btree_id, r->btree_id) ?:
cmp_int(l->level, r->level) ?:
cmp_int(l->cookie, r->cookie);
}
/*
* Given two found btree nodes, if their sequence numbers are equal, take the
* one that's readable:
*/
static int found_btree_node_cmp_time(const struct found_btree_node *l,
const struct found_btree_node *r)
{
return cmp_int(l->seq, r->seq) ?:
cmp_int(l->journal_seq, r->journal_seq);
}
static int found_btree_node_cmp_pos(const void *_l, const void *_r)
{
const struct found_btree_node *l = _l;
const struct found_btree_node *r = _r;
return cmp_int(l->btree_id, r->btree_id) ?:
-cmp_int(l->level, r->level) ?:
bpos_cmp(l->min_key, r->min_key) ?:
-found_btree_node_cmp_time(l, r);
}
static inline bool found_btree_node_cmp_pos_less(const void *l, const void *r, void *arg)
{
return found_btree_node_cmp_pos(l, r) < 0;
}
static inline void found_btree_node_swap(void *_l, void *_r, void *arg)
{
struct found_btree_node *l = _l;
struct found_btree_node *r = _r;
swap(*l, *r);
}
static const struct min_heap_callbacks found_btree_node_heap_cbs = {
.less = found_btree_node_cmp_pos_less,
.swp = found_btree_node_swap,
};
static void try_read_btree_node(struct find_btree_nodes *f, struct bch_dev *ca,
struct btree *b, struct bio *bio, u64 offset)
{
struct bch_fs *c = container_of(f, struct bch_fs, found_btree_nodes);
struct btree_node *bn = b->data;
bio_reset(bio, ca->disk_sb.bdev, REQ_OP_READ);
bio->bi_iter.bi_sector = offset;
bch2_bio_map(bio, b->data, c->opts.block_size);
u64 submit_time = local_clock();
submit_bio_wait(bio);
bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status);
if (bio->bi_status) {
bch_err_dev_ratelimited(ca,
"IO error in try_read_btree_node() at %llu: %s",
offset, bch2_blk_status_to_str(bio->bi_status));
return;
}
if (le64_to_cpu(bn->magic) != bset_magic(c))
return;
if (bch2_csum_type_is_encryption(BSET_CSUM_TYPE(&bn->keys))) {
if (!c->chacha20_key_set)
return;
struct nonce nonce = btree_nonce(&bn->keys, 0);
unsigned bytes = (void *) &bn->keys - (void *) &bn->flags;
bch2_encrypt(c, BSET_CSUM_TYPE(&bn->keys), nonce, &bn->flags, bytes);
}
if (btree_id_is_alloc(BTREE_NODE_ID(bn)))
return;
if (BTREE_NODE_LEVEL(bn) >= BTREE_MAX_DEPTH)
return;
if (BTREE_NODE_ID(bn) >= BTREE_ID_NR_MAX)
return;
rcu_read_lock();
struct found_btree_node n = {
.btree_id = BTREE_NODE_ID(bn),
.level = BTREE_NODE_LEVEL(bn),
.seq = BTREE_NODE_SEQ(bn),
.cookie = le64_to_cpu(bn->keys.seq),
.min_key = bn->min_key,
.max_key = bn->max_key,
.nr_ptrs = 1,
.ptrs[0].type = 1 << BCH_EXTENT_ENTRY_ptr,
.ptrs[0].offset = offset,
.ptrs[0].dev = ca->dev_idx,
.ptrs[0].gen = bucket_gen_get(ca, sector_to_bucket(ca, offset)),
};
rcu_read_unlock();
bio_reset(bio, ca->disk_sb.bdev, REQ_OP_READ);
bio->bi_iter.bi_sector = offset;
bch2_bio_map(bio, b->data, c->opts.btree_node_size);
submit_time = local_clock();
submit_bio_wait(bio);
bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status);
found_btree_node_to_key(&b->key, &n);
CLASS(printbuf, buf)();
if (!bch2_btree_node_read_done(c, ca, b, NULL, &buf)) {
/* read_done will swap out b->data for another buffer */
bn = b->data;
/*
* Grab journal_seq here because we want the max journal_seq of
* any bset; read_done sorts down to a single set and picks the
* max journal_seq
*/
n.journal_seq = le64_to_cpu(bn->keys.journal_seq),
n.sectors_written = b->written;
mutex_lock(&f->lock);
if (BSET_BIG_ENDIAN(&bn->keys) != CPU_BIG_ENDIAN) {
bch_err(c, "try_read_btree_node() can't handle endian conversion");
f->ret = -EINVAL;
goto unlock;
}
if (darray_push(&f->nodes, n))
f->ret = -ENOMEM;
unlock:
mutex_unlock(&f->lock);
}
}
static int read_btree_nodes_worker(void *p)
{
struct find_btree_nodes_worker *w = p;
struct bch_fs *c = container_of(w->f, struct bch_fs, found_btree_nodes);
struct bch_dev *ca = w->ca;
unsigned long last_print = jiffies;
struct btree *b = NULL;
struct bio *bio = NULL;
b = __bch2_btree_node_mem_alloc(c);
if (!b) {
bch_err(c, "read_btree_nodes_worker: error allocating buf");
w->f->ret = -ENOMEM;
goto err;
}
bio = bio_alloc(NULL, buf_pages(b->data, c->opts.btree_node_size), 0, GFP_KERNEL);
if (!bio) {
bch_err(c, "read_btree_nodes_worker: error allocating bio");
w->f->ret = -ENOMEM;
goto err;
}
for (u64 bucket = ca->mi.first_bucket; bucket < ca->mi.nbuckets; bucket++)
for (unsigned bucket_offset = 0;
bucket_offset + btree_sectors(c) <= ca->mi.bucket_size;
bucket_offset += btree_sectors(c)) {
if (time_after(jiffies, last_print + HZ * 30)) {
u64 cur_sector = bucket * ca->mi.bucket_size + bucket_offset;
u64 end_sector = ca->mi.nbuckets * ca->mi.bucket_size;
bch_info(ca, "%s: %2u%% done", __func__,
(unsigned) div64_u64(cur_sector * 100, end_sector));
last_print = jiffies;
}
u64 sector = bucket * ca->mi.bucket_size + bucket_offset;
if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_mi_btree_bitmap &&
!bch2_dev_btree_bitmap_marked_sectors(ca, sector, btree_sectors(c)))
continue;
try_read_btree_node(w->f, ca, b, bio, sector);
}
err:
if (b)
__btree_node_data_free(b);
kfree(b);
bio_put(bio);
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
closure_put(w->cl);
kfree(w);
return 0;
}
static int read_btree_nodes(struct find_btree_nodes *f)
{
struct bch_fs *c = container_of(f, struct bch_fs, found_btree_nodes);
struct closure cl;
int ret = 0;
closure_init_stack(&cl);
for_each_online_member(c, ca, BCH_DEV_READ_REF_btree_node_scan) {
if (!(ca->mi.data_allowed & BIT(BCH_DATA_btree)))
continue;
struct find_btree_nodes_worker *w = kmalloc(sizeof(*w), GFP_KERNEL);
if (!w) {
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
ret = -ENOMEM;
goto err;
}
w->cl = &cl;
w->f = f;
w->ca = ca;
struct task_struct *t = kthread_create(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name);
ret = PTR_ERR_OR_ZERO(t);
if (ret) {
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
kfree(w);
bch_err_msg(c, ret, "starting kthread");
break;
}
closure_get(&cl);
enumerated_ref_get(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
wake_up_process(t);
}
err:
while (closure_sync_timeout(&cl, sysctl_hung_task_timeout_secs * HZ / 2))
;
return f->ret ?: ret;
}
static bool nodes_overlap(const struct found_btree_node *l,
const struct found_btree_node *r)
{
return (l->btree_id == r->btree_id &&
l->level == r->level &&
bpos_gt(l->max_key, r->min_key));
}
static int handle_overwrites(struct bch_fs *c,
struct found_btree_node *l,
found_btree_nodes *nodes_heap)
{
struct found_btree_node *r;
while ((r = min_heap_peek(nodes_heap)) &&
nodes_overlap(l, r)) {
int cmp = found_btree_node_cmp_time(l, r);
if (cmp > 0) {
if (bpos_cmp(l->max_key, r->max_key) >= 0)
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
else {
r->range_updated = true;
r->min_key = bpos_successor(l->max_key);
r->range_updated = true;
min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL);
}
} else if (cmp < 0) {
BUG_ON(bpos_eq(l->min_key, r->min_key));
l->max_key = bpos_predecessor(r->min_key);
l->range_updated = true;
} else if (r->level) {
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
} else {
if (bpos_cmp(l->max_key, r->max_key) >= 0)
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
else {
r->range_updated = true;
r->min_key = bpos_successor(l->max_key);
r->range_updated = true;
min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL);
}
}
cond_resched();
}
return 0;
}
int bch2_scan_for_btree_nodes(struct bch_fs *c)
{
struct find_btree_nodes *f = &c->found_btree_nodes;
struct printbuf buf = PRINTBUF;
found_btree_nodes nodes_heap = {};
size_t dst;
int ret = 0;
if (f->nodes.nr)
return 0;
mutex_init(&f->lock);
ret = read_btree_nodes(f);
if (ret)
return ret;
if (!f->nodes.nr) {
bch_err(c, "%s: no btree nodes found", __func__);
ret = -EINVAL;
goto err;
}
if (0 && c->opts.verbose) {
printbuf_reset(&buf);
prt_printf(&buf, "%s: nodes found:\n", __func__);
found_btree_nodes_to_text(&buf, c, f->nodes);
bch2_print_str(c, KERN_INFO, buf.buf);
}
sort_nonatomic(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_cookie, NULL);
dst = 0;
darray_for_each(f->nodes, i) {
struct found_btree_node *prev = dst ? f->nodes.data + dst - 1 : NULL;
if (prev &&
prev->cookie == i->cookie) {
if (prev->nr_ptrs == ARRAY_SIZE(prev->ptrs)) {
bch_err(c, "%s: found too many replicas for btree node", __func__);
ret = -EINVAL;
goto err;
}
prev->ptrs[prev->nr_ptrs++] = i->ptrs[0];
} else {
f->nodes.data[dst++] = *i;
}
}
f->nodes.nr = dst;
sort_nonatomic(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_pos, NULL);
if (0 && c->opts.verbose) {
printbuf_reset(&buf);
prt_printf(&buf, "%s: nodes after merging replicas:\n", __func__);
found_btree_nodes_to_text(&buf, c, f->nodes);
bch2_print_str(c, KERN_INFO, buf.buf);
}
swap(nodes_heap, f->nodes);
{
/* darray must have same layout as a heap */
min_heap_char real_heap;
BUILD_BUG_ON(sizeof(nodes_heap.nr) != sizeof(real_heap.nr));
BUILD_BUG_ON(sizeof(nodes_heap.size) != sizeof(real_heap.size));
BUILD_BUG_ON(offsetof(found_btree_nodes, nr) != offsetof(min_heap_char, nr));
BUILD_BUG_ON(offsetof(found_btree_nodes, size) != offsetof(min_heap_char, size));
}
min_heapify_all(&nodes_heap, &found_btree_node_heap_cbs, NULL);
if (nodes_heap.nr) {
ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap));
if (ret)
goto err;
min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL);
}
while (true) {
ret = handle_overwrites(c, &darray_last(f->nodes), &nodes_heap);
if (ret)
goto err;
if (!nodes_heap.nr)
break;
ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap));
if (ret)
goto err;
min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL);
}
for (struct found_btree_node *n = f->nodes.data; n < &darray_last(f->nodes); n++)
BUG_ON(nodes_overlap(n, n + 1));
if (0 && c->opts.verbose) {
printbuf_reset(&buf);
prt_printf(&buf, "%s: nodes found after overwrites:\n", __func__);
found_btree_nodes_to_text(&buf, c, f->nodes);
bch2_print_str(c, KERN_INFO, buf.buf);
} else {
bch_info(c, "btree node scan found %zu nodes after overwrites", f->nodes.nr);
}
eytzinger0_sort(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_pos, NULL);
err:
darray_exit(&nodes_heap);
printbuf_exit(&buf);
return ret;
}
static int found_btree_node_range_start_cmp(const void *_l, const void *_r)
{
const struct found_btree_node *l = _l;
const struct found_btree_node *r = _r;
return cmp_int(l->btree_id, r->btree_id) ?:
-cmp_int(l->level, r->level) ?:
bpos_cmp(l->max_key, r->min_key);
}
#define for_each_found_btree_node_in_range(_f, _search, _idx) \
for (size_t _idx = eytzinger0_find_gt((_f)->nodes.data, (_f)->nodes.nr, \
sizeof((_f)->nodes.data[0]), \
found_btree_node_range_start_cmp, &search); \
_idx < (_f)->nodes.nr && \
(_f)->nodes.data[_idx].btree_id == _search.btree_id && \
(_f)->nodes.data[_idx].level == _search.level && \
bpos_lt((_f)->nodes.data[_idx].min_key, _search.max_key); \
_idx = eytzinger0_next(_idx, (_f)->nodes.nr))
bool bch2_btree_node_is_stale(struct bch_fs *c, struct btree *b)
{
struct find_btree_nodes *f = &c->found_btree_nodes;
struct found_btree_node search = {
.btree_id = b->c.btree_id,
.level = b->c.level,
.min_key = b->data->min_key,
.max_key = b->key.k.p,
};
for_each_found_btree_node_in_range(f, search, idx)
if (f->nodes.data[idx].seq > BTREE_NODE_SEQ(b->data))
return true;
return false;
}
int bch2_btree_has_scanned_nodes(struct bch_fs *c, enum btree_id btree)
{
int ret = bch2_run_print_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes);
if (ret)
return ret;
struct found_btree_node search = {
.btree_id = btree,
.level = 0,
.min_key = POS_MIN,
.max_key = SPOS_MAX,
};
for_each_found_btree_node_in_range(&c->found_btree_nodes, search, idx)
return true;
return false;
}
int bch2_get_scanned_nodes(struct bch_fs *c, enum btree_id btree,
unsigned level, struct bpos node_min, struct bpos node_max)
{
if (btree_id_is_alloc(btree))
return 0;
struct find_btree_nodes *f = &c->found_btree_nodes;
int ret = bch2_run_print_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes);
if (ret)
return ret;
if (c->opts.verbose) {
struct printbuf buf = PRINTBUF;
prt_str(&buf, "recovery ");
bch2_btree_id_level_to_text(&buf, btree, level);
prt_str(&buf, " ");
bch2_bpos_to_text(&buf, node_min);
prt_str(&buf, " - ");
bch2_bpos_to_text(&buf, node_max);
bch_info(c, "%s(): %s", __func__, buf.buf);
printbuf_exit(&buf);
}
struct found_btree_node search = {
.btree_id = btree,
.level = level,
.min_key = node_min,
.max_key = node_max,
};
for_each_found_btree_node_in_range(f, search, idx) {
struct found_btree_node n = f->nodes.data[idx];
n.range_updated |= bpos_lt(n.min_key, node_min);
n.min_key = bpos_max(n.min_key, node_min);
n.range_updated |= bpos_gt(n.max_key, node_max);
n.max_key = bpos_min(n.max_key, node_max);
struct { __BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX); } tmp;
found_btree_node_to_key(&tmp.k, &n);
if (c->opts.verbose) {
struct printbuf buf = PRINTBUF;
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&tmp.k));
bch_verbose(c, "%s(): recovering %s", __func__, buf.buf);
printbuf_exit(&buf);
}
BUG_ON(bch2_bkey_validate(c, bkey_i_to_s_c(&tmp.k),
(struct bkey_validate_context) {
.from = BKEY_VALIDATE_btree_node,
.level = level + 1,
.btree = btree,
}));
ret = bch2_journal_key_insert(c, btree, level + 1, &tmp.k);
if (ret)
return ret;
}
return 0;
}
void bch2_find_btree_nodes_exit(struct find_btree_nodes *f)
{
darray_exit(&f->nodes);
}

View File

@ -1,11 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_NODE_SCAN_H
#define _BCACHEFS_BTREE_NODE_SCAN_H
int bch2_scan_for_btree_nodes(struct bch_fs *);
bool bch2_btree_node_is_stale(struct bch_fs *, struct btree *);
int bch2_btree_has_scanned_nodes(struct bch_fs *, enum btree_id);
int bch2_get_scanned_nodes(struct bch_fs *, enum btree_id, unsigned, struct bpos, struct bpos);
void bch2_find_btree_nodes_exit(struct find_btree_nodes *);
#endif /* _BCACHEFS_BTREE_NODE_SCAN_H */

View File

@ -1,31 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_NODE_SCAN_TYPES_H
#define _BCACHEFS_BTREE_NODE_SCAN_TYPES_H
#include "darray.h"
struct found_btree_node {
bool range_updated:1;
u8 btree_id;
u8 level;
unsigned sectors_written;
u32 seq;
u64 journal_seq;
u64 cookie;
struct bpos min_key;
struct bpos max_key;
unsigned nr_ptrs;
struct bch_extent_ptr ptrs[BCH_REPLICAS_MAX];
};
typedef DARRAY(struct found_btree_node) found_btree_nodes;
struct find_btree_nodes {
int ret;
struct mutex lock;
found_btree_nodes nodes;
};
#endif /* _BCACHEFS_BTREE_NODE_SCAN_TYPES_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,937 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_TYPES_H
#define _BCACHEFS_BTREE_TYPES_H
#include <linux/list.h>
#include <linux/rhashtable.h>
#include "bbpos_types.h"
#include "btree_key_cache_types.h"
#include "buckets_types.h"
#include "darray.h"
#include "errcode.h"
#include "journal_types.h"
#include "replicas_types.h"
#include "six.h"
struct open_bucket;
struct btree_update;
struct btree_trans;
#define MAX_BSETS 3U
struct btree_nr_keys {
/*
* Amount of live metadata (i.e. size of node after a compaction) in
* units of u64s
*/
u16 live_u64s;
u16 bset_u64s[MAX_BSETS];
/* live keys only: */
u16 packed_keys;
u16 unpacked_keys;
};
struct bset_tree {
/*
* We construct a binary tree in an array as if the array
* started at 1, so that things line up on the same cachelines
* better: see comments in bset.c at cacheline_to_bkey() for
* details
*/
/* size of the binary tree and prev array */
u16 size;
/* function of size - precalculated for to_inorder() */
u16 extra;
u16 data_offset;
u16 aux_data_offset;
u16 end_offset;
};
struct btree_write {
struct journal_entry_pin journal;
};
struct btree_alloc {
struct open_buckets ob;
__BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX);
};
struct btree_bkey_cached_common {
struct six_lock lock;
u8 level;
u8 btree_id;
bool cached;
};
struct btree {
struct btree_bkey_cached_common c;
struct rhash_head hash;
u64 hash_val;
unsigned long flags;
u16 written;
u8 nsets;
u8 nr_key_bits;
u16 version_ondisk;
struct bkey_format format;
struct btree_node *data;
void *aux_data;
/*
* Sets of sorted keys - the real btree node - plus a binary search tree
*
* set[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
* to the memory we have allocated for this btree node. Additionally,
* set[0]->data points to the entire btree node as it exists on disk.
*/
struct bset_tree set[MAX_BSETS];
struct btree_nr_keys nr;
u16 sib_u64s[2];
u16 whiteout_u64s;
u8 byte_order;
u8 unpack_fn_len;
struct btree_write writes[2];
/* Key/pointer for this btree node */
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
/*
* XXX: add a delete sequence number, so when bch2_btree_node_relock()
* fails because the lock sequence number has changed - i.e. the
* contents were modified - we can still relock the node if it's still
* the one we want, without redoing the traversal
*/
/*
* For asynchronous splits/interior node updates:
* When we do a split, we allocate new child nodes and update the parent
* node to point to them: we update the parent in memory immediately,
* but then we must wait until the children have been written out before
* the update to the parent can be written - this is a list of the
* btree_updates that are blocking this node from being
* written:
*/
struct list_head write_blocked;
/*
* Also for asynchronous splits/interior node updates:
* If a btree node isn't reachable yet, we don't want to kick off
* another write - because that write also won't yet be reachable and
* marking it as completed before it's reachable would be incorrect:
*/
unsigned long will_make_reachable;
struct open_buckets ob;
/* lru list */
struct list_head list;
};
#define BCH_BTREE_CACHE_NOT_FREED_REASONS() \
x(cache_reserve) \
x(lock_intent) \
x(lock_write) \
x(dirty) \
x(read_in_flight) \
x(write_in_flight) \
x(noevict) \
x(write_blocked) \
x(will_make_reachable) \
x(access_bit)
enum bch_btree_cache_not_freed_reasons {
#define x(n) BCH_BTREE_CACHE_NOT_FREED_##n,
BCH_BTREE_CACHE_NOT_FREED_REASONS()
#undef x
BCH_BTREE_CACHE_NOT_FREED_REASONS_NR,
};
struct btree_cache_list {
unsigned idx;
struct shrinker *shrink;
struct list_head list;
size_t nr;
};
struct btree_cache {
struct rhashtable table;
bool table_init_done;
/*
* We never free a struct btree, except on shutdown - we just put it on
* the btree_cache_freed list and reuse it later. This simplifies the
* code, and it doesn't cost us much memory as the memory usage is
* dominated by buffers that hold the actual btree node data and those
* can be freed - and the number of struct btrees allocated is
* effectively bounded.
*
* btree_cache_freeable effectively is a small cache - we use it because
* high order page allocations can be rather expensive, and it's quite
* common to delete and allocate btree nodes in quick succession. It
* should never grow past ~2-3 nodes in practice.
*/
struct mutex lock;
struct list_head freeable;
struct list_head freed_pcpu;
struct list_head freed_nonpcpu;
struct btree_cache_list live[2];
size_t nr_freeable;
size_t nr_reserve;
size_t nr_by_btree[BTREE_ID_NR];
atomic_long_t nr_dirty;
/* shrinker stats */
size_t nr_freed;
u64 not_freed[BCH_BTREE_CACHE_NOT_FREED_REASONS_NR];
/*
* If we need to allocate memory for a new btree node and that
* allocation fails, we can cannibalize another node in the btree cache
* to satisfy the allocation - lock to guarantee only one thread does
* this at a time:
*/
struct task_struct *alloc_lock;
struct closure_waitlist alloc_wait;
struct bbpos pinned_nodes_start;
struct bbpos pinned_nodes_end;
/* btree id mask: 0 for leaves, 1 for interior */
u64 pinned_nodes_mask[2];
};
struct btree_node_iter {
struct btree_node_iter_set {
u16 k, end;
} data[MAX_BSETS];
};
#define BTREE_ITER_FLAGS() \
x(slots) \
x(intent) \
x(prefetch) \
x(is_extents) \
x(not_extents) \
x(cached) \
x(with_key_cache) \
x(with_updates) \
x(with_journal) \
x(snapshot_field) \
x(all_snapshots) \
x(filter_snapshots) \
x(nopreserve) \
x(cached_nofill) \
x(key_cache_fill) \
#define STR_HASH_FLAGS() \
x(must_create) \
x(must_replace)
#define BTREE_UPDATE_FLAGS() \
x(internal_snapshot_node) \
x(nojournal) \
x(key_cache_reclaim)
/*
* BTREE_TRIGGER_norun - don't run triggers at all
*
* BTREE_TRIGGER_transactional - we're running transactional triggers as part of
* a transaction commit: triggers may generate new updates
*
* BTREE_TRIGGER_atomic - we're running atomic triggers during a transaction
* commit: we have our journal reservation, we're holding btree node write
* locks, and we know the transaction is going to commit (returning an error
* here is a fatal error, causing us to go emergency read-only)
*
* BTREE_TRIGGER_gc - we're in gc/fsck: running triggers to recalculate e.g. disk usage
*
* BTREE_TRIGGER_insert - @new is entering the btree
* BTREE_TRIGGER_overwrite - @old is leaving the btree
*/
#define BTREE_TRIGGER_FLAGS() \
x(norun) \
x(transactional) \
x(atomic) \
x(check_repair) \
x(gc) \
x(insert) \
x(overwrite) \
x(is_root)
enum {
#define x(n) BTREE_ITER_FLAG_BIT_##n,
BTREE_ITER_FLAGS()
STR_HASH_FLAGS()
BTREE_UPDATE_FLAGS()
BTREE_TRIGGER_FLAGS()
#undef x
};
/* iter flags must fit in a u16: */
//BUILD_BUG_ON(BTREE_ITER_FLAG_BIT_key_cache_fill > 15);
enum btree_iter_update_trigger_flags {
#define x(n) BTREE_ITER_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
BTREE_ITER_FLAGS()
#undef x
#define x(n) STR_HASH_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
STR_HASH_FLAGS()
#undef x
#define x(n) BTREE_UPDATE_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
BTREE_UPDATE_FLAGS()
#undef x
#define x(n) BTREE_TRIGGER_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
BTREE_TRIGGER_FLAGS()
#undef x
};
enum btree_path_uptodate {
BTREE_ITER_UPTODATE = 0,
BTREE_ITER_NEED_RELOCK = 1,
BTREE_ITER_NEED_TRAVERSE = 2,
};
#if defined(CONFIG_BCACHEFS_LOCK_TIME_STATS) || defined(CONFIG_BCACHEFS_DEBUG)
#define TRACK_PATH_ALLOCATED
#endif
typedef u16 btree_path_idx_t;
struct btree_path {
btree_path_idx_t sorted_idx;
u8 ref;
u8 intent_ref;
/* btree_iter_copy starts here: */
struct bpos pos;
enum btree_id btree_id:5;
bool cached:1;
bool preserve:1;
enum btree_path_uptodate uptodate:2;
/*
* When true, failing to relock this path will cause the transaction to
* restart:
*/
bool should_be_locked:1;
unsigned level:3,
locks_want:3;
u8 nodes_locked;
struct btree_path_level {
struct btree *b;
struct btree_node_iter iter;
u32 lock_seq;
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
u64 lock_taken_time;
#endif
} l[BTREE_MAX_DEPTH];
#ifdef TRACK_PATH_ALLOCATED
unsigned long ip_allocated;
#endif
};
static inline struct btree_path_level *path_l(struct btree_path *path)
{
return path->l + path->level;
}
static inline unsigned long btree_path_ip_allocated(struct btree_path *path)
{
#ifdef TRACK_PATH_ALLOCATED
return path->ip_allocated;
#else
return _THIS_IP_;
#endif
}
/*
* @pos - iterator's current position
* @level - current btree depth
* @locks_want - btree level below which we start taking intent locks
* @nodes_locked - bitmask indicating which nodes in @nodes are locked
* @nodes_intent_locked - bitmask indicating which locks are intent locks
*/
struct btree_iter {
btree_path_idx_t path;
btree_path_idx_t update_path;
btree_path_idx_t key_cache_path;
enum btree_id btree_id:8;
u8 min_depth;
/* btree_iter_copy starts here: */
u16 flags;
/* When we're filtering by snapshot, the snapshot ID we're looking for: */
unsigned snapshot;
struct bpos pos;
/*
* Current unpacked key - so that bch2_btree_iter_next()/
* bch2_btree_iter_next_slot() can correctly advance pos.
*/
struct bkey k;
/* BTREE_ITER_with_journal: */
size_t journal_idx;
#ifdef TRACK_PATH_ALLOCATED
unsigned long ip_allocated;
#endif
};
#define BKEY_CACHED_ACCESSED 0
#define BKEY_CACHED_DIRTY 1
struct bkey_cached {
struct btree_bkey_cached_common c;
unsigned long flags;
u16 u64s;
struct bkey_cached_key key;
struct rhash_head hash;
struct journal_entry_pin journal;
u64 seq;
struct bkey_i *k;
struct rcu_head rcu;
};
static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
{
return !b->cached
? container_of(b, struct btree, c)->key.k.p
: container_of(b, struct bkey_cached, c)->key.pos;
}
struct btree_insert_entry {
unsigned flags;
u8 sort_order;
u8 bkey_type;
enum btree_id btree_id:8;
u8 level:4;
bool cached:1;
bool insert_trigger_run:1;
bool overwrite_trigger_run:1;
bool key_cache_already_flushed:1;
/*
* @old_k may be a key from the journal; @old_btree_u64s always refers
* to the size of the key being overwritten in the btree:
*/
u8 old_btree_u64s;
btree_path_idx_t path;
struct bkey_i *k;
/* key being overwritten: */
struct bkey old_k;
const struct bch_val *old_v;
unsigned long ip_allocated;
};
/* Number of btree paths we preallocate, usually enough */
#define BTREE_ITER_INITIAL 64
/*
* Lmiit for btree_trans_too_many_iters(); this is enough that almost all code
* paths should run inside this limit, and if they don't it usually indicates a
* bug (leaking/duplicated btree paths).
*
* exception: some fsck paths
*
* bugs with excessive path usage seem to have possibly been eliminated now, so
* we might consider eliminating this (and btree_trans_too_many_iter()) at some
* point.
*/
#define BTREE_ITER_NORMAL_LIMIT 256
/* never exceed limit */
#define BTREE_ITER_MAX (1U << 10)
struct btree_trans_commit_hook;
typedef int (btree_trans_commit_hook_fn)(struct btree_trans *, struct btree_trans_commit_hook *);
struct btree_trans_commit_hook {
btree_trans_commit_hook_fn *fn;
struct btree_trans_commit_hook *next;
};
#define BTREE_TRANS_MEM_MAX (1U << 16)
#define BTREE_TRANS_MAX_LOCK_HOLD_TIME_NS 10000
struct btree_trans_paths {
unsigned long nr_paths;
struct btree_path paths[];
};
struct trans_kmalloc_trace {
unsigned long ip;
size_t bytes;
};
typedef DARRAY(struct trans_kmalloc_trace) darray_trans_kmalloc_trace;
struct btree_trans_subbuf {
u16 base;
u16 u64s;
u16 size;;
};
struct btree_trans {
struct bch_fs *c;
unsigned long *paths_allocated;
struct btree_path *paths;
btree_path_idx_t *sorted;
struct btree_insert_entry *updates;
void *mem;
unsigned mem_top;
unsigned mem_bytes;
unsigned realloc_bytes_required;
#ifdef CONFIG_BCACHEFS_TRANS_KMALLOC_TRACE
darray_trans_kmalloc_trace trans_kmalloc_trace;
#endif
btree_path_idx_t nr_sorted;
btree_path_idx_t nr_paths;
btree_path_idx_t nr_paths_max;
btree_path_idx_t nr_updates;
u8 fn_idx;
u8 lock_must_abort;
bool lock_may_not_fail:1;
bool srcu_held:1;
bool locked:1;
bool pf_memalloc_nofs:1;
bool write_locked:1;
bool used_mempool:1;
bool in_traverse_all:1;
bool paths_sorted:1;
bool memory_allocation_failure:1;
bool journal_transaction_names:1;
bool journal_replay_not_finished:1;
bool notrace_relock_fail:1;
enum bch_errcode restarted:16;
u32 restart_count;
#ifdef CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS
u32 restart_count_this_trans;
#endif
u64 last_begin_time;
unsigned long last_begin_ip;
unsigned long last_restarted_ip;
#ifdef CONFIG_BCACHEFS_DEBUG
bch_stacktrace last_restarted_trace;
#endif
unsigned long last_unlock_ip;
unsigned long srcu_lock_time;
const char *fn;
struct btree_bkey_cached_common *locking;
struct six_lock_waiter locking_wait;
int srcu_idx;
/* update path: */
struct btree_trans_subbuf journal_entries;
struct btree_trans_subbuf accounting;
struct btree_trans_commit_hook *hooks;
struct journal_entry_pin *journal_pin;
struct journal_res journal_res;
u64 *journal_seq;
struct disk_reservation *disk_res;
struct bch_fs_usage_base fs_usage_delta;
unsigned journal_u64s;
unsigned extra_disk_res; /* XXX kill */
__BKEY_PADDED(btree_path_down, BKEY_BTREE_PTR_VAL_U64s_MAX);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
/* Entries before this are zeroed out on every bch2_trans_get() call */
struct list_head list;
struct closure ref;
unsigned long _paths_allocated[BITS_TO_LONGS(BTREE_ITER_INITIAL)];
struct btree_trans_paths trans_paths;
struct btree_path _paths[BTREE_ITER_INITIAL];
btree_path_idx_t _sorted[BTREE_ITER_INITIAL + 4];
struct btree_insert_entry _updates[BTREE_ITER_INITIAL];
};
static inline struct btree_path *btree_iter_path(struct btree_trans *trans, struct btree_iter *iter)
{
return trans->paths + iter->path;
}
static inline struct btree_path *btree_iter_key_cache_path(struct btree_trans *trans, struct btree_iter *iter)
{
return iter->key_cache_path
? trans->paths + iter->key_cache_path
: NULL;
}
#define BCH_BTREE_WRITE_TYPES() \
x(initial, 0) \
x(init_next_bset, 1) \
x(cache_reclaim, 2) \
x(journal_reclaim, 3) \
x(interior, 4)
enum btree_write_type {
#define x(t, n) BTREE_WRITE_##t,
BCH_BTREE_WRITE_TYPES()
#undef x
BTREE_WRITE_TYPE_NR,
};
#define BTREE_WRITE_TYPE_MASK (roundup_pow_of_two(BTREE_WRITE_TYPE_NR) - 1)
#define BTREE_WRITE_TYPE_BITS ilog2(roundup_pow_of_two(BTREE_WRITE_TYPE_NR))
#define BTREE_FLAGS() \
x(read_in_flight) \
x(read_error) \
x(dirty) \
x(need_write) \
x(write_blocked) \
x(will_make_reachable) \
x(noevict) \
x(write_idx) \
x(accessed) \
x(write_in_flight) \
x(write_in_flight_inner) \
x(just_written) \
x(dying) \
x(fake) \
x(need_rewrite) \
x(need_rewrite_error) \
x(need_rewrite_degraded) \
x(need_rewrite_ptr_written_zero) \
x(never_write) \
x(pinned)
enum btree_flags {
/* First bits for btree node write type */
BTREE_NODE_FLAGS_START = BTREE_WRITE_TYPE_BITS - 1,
#define x(flag) BTREE_NODE_##flag,
BTREE_FLAGS()
#undef x
};
#define x(flag) \
static inline bool btree_node_ ## flag(struct btree *b) \
{ return test_bit(BTREE_NODE_ ## flag, &b->flags); } \
\
static inline void set_btree_node_ ## flag(struct btree *b) \
{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
\
static inline void clear_btree_node_ ## flag(struct btree *b) \
{ clear_bit(BTREE_NODE_ ## flag, &b->flags); }
BTREE_FLAGS()
#undef x
#define BTREE_NODE_REWRITE_REASON() \
x(none) \
x(unknown) \
x(error) \
x(degraded) \
x(ptr_written_zero)
enum btree_node_rewrite_reason {
#define x(n) BTREE_NODE_REWRITE_##n,
BTREE_NODE_REWRITE_REASON()
#undef x
};
static inline enum btree_node_rewrite_reason btree_node_rewrite_reason(struct btree *b)
{
if (btree_node_need_rewrite_ptr_written_zero(b))
return BTREE_NODE_REWRITE_ptr_written_zero;
if (btree_node_need_rewrite_degraded(b))
return BTREE_NODE_REWRITE_degraded;
if (btree_node_need_rewrite_error(b))
return BTREE_NODE_REWRITE_error;
if (btree_node_need_rewrite(b))
return BTREE_NODE_REWRITE_unknown;
return BTREE_NODE_REWRITE_none;
}
static inline struct btree_write *btree_current_write(struct btree *b)
{
return b->writes + btree_node_write_idx(b);
}
static inline struct btree_write *btree_prev_write(struct btree *b)
{
return b->writes + (btree_node_write_idx(b) ^ 1);
}
static inline struct bset_tree *bset_tree_last(struct btree *b)
{
EBUG_ON(!b->nsets);
return b->set + b->nsets - 1;
}
static inline void *
__btree_node_offset_to_ptr(const struct btree *b, u16 offset)
{
return (void *) ((u64 *) b->data + offset);
}
static inline u16
__btree_node_ptr_to_offset(const struct btree *b, const void *p)
{
u16 ret = (u64 *) p - (u64 *) b->data;
EBUG_ON(__btree_node_offset_to_ptr(b, ret) != p);
return ret;
}
static inline struct bset *bset(const struct btree *b,
const struct bset_tree *t)
{
return __btree_node_offset_to_ptr(b, t->data_offset);
}
static inline void set_btree_bset_end(struct btree *b, struct bset_tree *t)
{
t->end_offset =
__btree_node_ptr_to_offset(b, vstruct_last(bset(b, t)));
}
static inline void set_btree_bset(struct btree *b, struct bset_tree *t,
const struct bset *i)
{
t->data_offset = __btree_node_ptr_to_offset(b, i);
set_btree_bset_end(b, t);
}
static inline struct bset *btree_bset_first(struct btree *b)
{
return bset(b, b->set);
}
static inline struct bset *btree_bset_last(struct btree *b)
{
return bset(b, bset_tree_last(b));
}
static inline u16
__btree_node_key_to_offset(const struct btree *b, const struct bkey_packed *k)
{
return __btree_node_ptr_to_offset(b, k);
}
static inline struct bkey_packed *
__btree_node_offset_to_key(const struct btree *b, u16 k)
{
return __btree_node_offset_to_ptr(b, k);
}
static inline unsigned btree_bkey_first_offset(const struct bset_tree *t)
{
return t->data_offset + offsetof(struct bset, _data) / sizeof(u64);
}
#define btree_bkey_first(_b, _t) \
({ \
EBUG_ON(bset(_b, _t)->start != \
__btree_node_offset_to_key(_b, btree_bkey_first_offset(_t)));\
\
bset(_b, _t)->start; \
})
#define btree_bkey_last(_b, _t) \
({ \
EBUG_ON(__btree_node_offset_to_key(_b, (_t)->end_offset) != \
vstruct_last(bset(_b, _t))); \
\
__btree_node_offset_to_key(_b, (_t)->end_offset); \
})
static inline unsigned bset_u64s(struct bset_tree *t)
{
return t->end_offset - t->data_offset -
sizeof(struct bset) / sizeof(u64);
}
static inline unsigned bset_dead_u64s(struct btree *b, struct bset_tree *t)
{
return bset_u64s(t) - b->nr.bset_u64s[t - b->set];
}
static inline unsigned bset_byte_offset(struct btree *b, void *i)
{
return i - (void *) b->data;
}
enum btree_node_type {
BKEY_TYPE_btree,
#define x(kwd, val, ...) BKEY_TYPE_##kwd = val + 1,
BCH_BTREE_IDS()
#undef x
BKEY_TYPE_NR
};
/* Type of a key in btree @id at level @level: */
static inline enum btree_node_type __btree_node_type(unsigned level, enum btree_id id)
{
return level ? BKEY_TYPE_btree : (unsigned) id + 1;
}
/* Type of keys @b contains: */
static inline enum btree_node_type btree_node_type(struct btree *b)
{
return __btree_node_type(b->c.level, b->c.btree_id);
}
const char *bch2_btree_node_type_str(enum btree_node_type);
#define BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS \
(BIT_ULL(BKEY_TYPE_extents)| \
BIT_ULL(BKEY_TYPE_alloc)| \
BIT_ULL(BKEY_TYPE_inodes)| \
BIT_ULL(BKEY_TYPE_stripes)| \
BIT_ULL(BKEY_TYPE_reflink)| \
BIT_ULL(BKEY_TYPE_subvolumes)| \
BIT_ULL(BKEY_TYPE_btree))
#define BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS \
(BIT_ULL(BKEY_TYPE_alloc)| \
BIT_ULL(BKEY_TYPE_inodes)| \
BIT_ULL(BKEY_TYPE_stripes)| \
BIT_ULL(BKEY_TYPE_snapshots))
#define BTREE_NODE_TYPE_HAS_TRIGGERS \
(BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS| \
BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS)
static inline bool btree_node_type_has_trans_triggers(enum btree_node_type type)
{
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS;
}
static inline bool btree_node_type_has_atomic_triggers(enum btree_node_type type)
{
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS;
}
static inline bool btree_node_type_has_triggers(enum btree_node_type type)
{
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_TRIGGERS;
}
static inline bool btree_id_is_extents(enum btree_id btree)
{
const u64 mask = 0
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_extents)) << nr)
BCH_BTREE_IDS()
#undef x
;
return BIT_ULL(btree) & mask;
}
static inline bool btree_node_type_is_extents(enum btree_node_type type)
{
return type != BKEY_TYPE_btree && btree_id_is_extents(type - 1);
}
static inline bool btree_type_has_snapshots(enum btree_id btree)
{
const u64 mask = 0
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_snapshots)) << nr)
BCH_BTREE_IDS()
#undef x
;
return BIT_ULL(btree) & mask;
}
static inline bool btree_type_has_snapshot_field(enum btree_id btree)
{
const u64 mask = 0
#define x(name, nr, flags, ...) |((!!((flags) & (BTREE_IS_snapshot_field|BTREE_IS_snapshots))) << nr)
BCH_BTREE_IDS()
#undef x
;
return BIT_ULL(btree) & mask;
}
static inline bool btree_type_has_ptrs(enum btree_id btree)
{
const u64 mask = 0
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_data)) << nr)
BCH_BTREE_IDS()
#undef x
;
return BIT_ULL(btree) & mask;
}
static inline bool btree_type_uses_write_buffer(enum btree_id btree)
{
const u64 mask = 0
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_write_buffer)) << nr)
BCH_BTREE_IDS()
#undef x
;
return BIT_ULL(btree) & mask;
}
static inline u8 btree_trigger_order(enum btree_id btree)
{
switch (btree) {
case BTREE_ID_alloc:
return U8_MAX;
case BTREE_ID_stripes:
return U8_MAX - 1;
default:
return btree;
}
}
struct btree_root {
struct btree *b;
/* On disk root - see async splits: */
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
u8 level;
u8 alive;
s16 error;
};
enum btree_gc_coalesce_fail_reason {
BTREE_GC_COALESCE_FAIL_RESERVE_GET,
BTREE_GC_COALESCE_FAIL_KEYLIST_REALLOC,
BTREE_GC_COALESCE_FAIL_FORMAT_FITS,
};
enum btree_node_sibling {
btree_prev_sib,
btree_next_sib,
};
struct get_locks_fail {
unsigned l;
struct btree *b;
};
#endif /* _BCACHEFS_BTREE_TYPES_H */

View File

@ -1,916 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "btree_update.h"
#include "btree_iter.h"
#include "btree_journal_iter.h"
#include "btree_locking.h"
#include "buckets.h"
#include "debug.h"
#include "errcode.h"
#include "error.h"
#include "extents.h"
#include "keylist.h"
#include "snapshot.h"
#include "trace.h"
#include <linux/string_helpers.h>
static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
const struct btree_insert_entry *r)
{
return cmp_int(l->sort_order, r->sort_order) ?:
cmp_int(l->cached, r->cached) ?:
-cmp_int(l->level, r->level) ?:
bpos_cmp(l->k->k.p, r->k->k.p);
}
static int __must_check
bch2_trans_update_by_path(struct btree_trans *, btree_path_idx_t,
struct bkey_i *, enum btree_iter_update_trigger_flags,
unsigned long ip);
static noinline int extent_front_merge(struct btree_trans *trans,
struct btree_iter *iter,
struct bkey_s_c k,
struct bkey_i **insert,
enum btree_iter_update_trigger_flags flags)
{
struct bch_fs *c = trans->c;
struct bkey_i *update;
int ret;
if (unlikely(trans->journal_replay_not_finished))
return 0;
update = bch2_bkey_make_mut_noupdate(trans, k);
ret = PTR_ERR_OR_ZERO(update);
if (ret)
return ret;
if (!bch2_bkey_merge(c, bkey_i_to_s(update), bkey_i_to_s_c(*insert)))
return 0;
ret = bch2_key_has_snapshot_overwrites(trans, iter->btree_id, k.k->p) ?:
bch2_key_has_snapshot_overwrites(trans, iter->btree_id, (*insert)->k.p);
if (ret < 0)
return ret;
if (ret)
return 0;
ret = bch2_btree_delete_at(trans, iter, flags);
if (ret)
return ret;
*insert = update;
return 0;
}
static noinline int extent_back_merge(struct btree_trans *trans,
struct btree_iter *iter,
struct bkey_i *insert,
struct bkey_s_c k)
{
struct bch_fs *c = trans->c;
int ret;
if (unlikely(trans->journal_replay_not_finished))
return 0;
ret = bch2_key_has_snapshot_overwrites(trans, iter->btree_id, insert->k.p) ?:
bch2_key_has_snapshot_overwrites(trans, iter->btree_id, k.k->p);
if (ret < 0)
return ret;
if (ret)
return 0;
bch2_bkey_merge(c, bkey_i_to_s(insert), k);
return 0;
}
/*
* When deleting, check if we need to emit a whiteout (because we're overwriting
* something in an ancestor snapshot)
*/
static int need_whiteout_for_snapshot(struct btree_trans *trans,
enum btree_id btree_id, struct bpos pos)
{
struct btree_iter iter;
struct bkey_s_c k;
u32 snapshot = pos.snapshot;
int ret;
if (!bch2_snapshot_parent(trans->c, pos.snapshot))
return 0;
pos.snapshot++;
for_each_btree_key_norestart(trans, iter, btree_id, pos,
BTREE_ITER_all_snapshots|
BTREE_ITER_nopreserve, k, ret) {
if (!bkey_eq(k.k->p, pos))
break;
if (bch2_snapshot_is_ancestor(trans->c, snapshot,
k.k->p.snapshot)) {
ret = !bkey_whiteout(k.k);
break;
}
}
bch2_trans_iter_exit(trans, &iter);
return ret;
}
int __bch2_insert_snapshot_whiteouts(struct btree_trans *trans,
enum btree_id btree, struct bpos pos,
snapshot_id_list *s)
{
int ret = 0;
darray_for_each(*s, id) {
pos.snapshot = *id;
struct btree_iter iter;
struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, btree, pos,
BTREE_ITER_not_extents|
BTREE_ITER_intent);
ret = bkey_err(k);
if (ret)
break;
if (k.k->type == KEY_TYPE_deleted) {
struct bkey_i *update = bch2_trans_kmalloc(trans, sizeof(struct bkey_i));
ret = PTR_ERR_OR_ZERO(update);
if (ret) {
bch2_trans_iter_exit(trans, &iter);
break;
}
bkey_init(&update->k);
update->k.p = pos;
update->k.type = KEY_TYPE_whiteout;
ret = bch2_trans_update(trans, &iter, update,
BTREE_UPDATE_internal_snapshot_node);
}
bch2_trans_iter_exit(trans, &iter);
if (ret)
break;
}
darray_exit(s);
return ret;
}
int bch2_trans_update_extent_overwrite(struct btree_trans *trans,
struct btree_iter *iter,
enum btree_iter_update_trigger_flags flags,
struct bkey_s_c old,
struct bkey_s_c new)
{
enum btree_id btree_id = iter->btree_id;
struct bkey_i *update;
struct bpos new_start = bkey_start_pos(new.k);
unsigned front_split = bkey_lt(bkey_start_pos(old.k), new_start);
unsigned back_split = bkey_gt(old.k->p, new.k->p);
unsigned middle_split = (front_split || back_split) &&
old.k->p.snapshot != new.k->p.snapshot;
unsigned nr_splits = front_split + back_split + middle_split;
int ret = 0, compressed_sectors;
/*
* If we're going to be splitting a compressed extent, note it
* so that __bch2_trans_commit() can increase our disk
* reservation:
*/
if (nr_splits > 1 &&
(compressed_sectors = bch2_bkey_sectors_compressed(old)))
trans->extra_disk_res += compressed_sectors * (nr_splits - 1);
if (front_split) {
update = bch2_bkey_make_mut_noupdate(trans, old);
if ((ret = PTR_ERR_OR_ZERO(update)))
return ret;
bch2_cut_back(new_start, update);
ret = bch2_insert_snapshot_whiteouts(trans, btree_id,
old.k->p, update->k.p) ?:
bch2_btree_insert_nonextent(trans, btree_id, update,
BTREE_UPDATE_internal_snapshot_node|flags);
if (ret)
return ret;
}
/* If we're overwriting in a different snapshot - middle split: */
if (middle_split) {
update = bch2_bkey_make_mut_noupdate(trans, old);
if ((ret = PTR_ERR_OR_ZERO(update)))
return ret;
bch2_cut_front(new_start, update);
bch2_cut_back(new.k->p, update);
ret = bch2_insert_snapshot_whiteouts(trans, btree_id,
old.k->p, update->k.p) ?:
bch2_btree_insert_nonextent(trans, btree_id, update,
BTREE_UPDATE_internal_snapshot_node|flags);
if (ret)
return ret;
}
if (bkey_le(old.k->p, new.k->p)) {
update = bch2_trans_kmalloc(trans, sizeof(*update));
if ((ret = PTR_ERR_OR_ZERO(update)))
return ret;
bkey_init(&update->k);
update->k.p = old.k->p;
update->k.p.snapshot = new.k->p.snapshot;
if (new.k->p.snapshot != old.k->p.snapshot) {
update->k.type = KEY_TYPE_whiteout;
} else if (btree_type_has_snapshots(btree_id)) {
ret = need_whiteout_for_snapshot(trans, btree_id, update->k.p);
if (ret < 0)
return ret;
if (ret)
update->k.type = KEY_TYPE_whiteout;
}
ret = bch2_btree_insert_nonextent(trans, btree_id, update,
BTREE_UPDATE_internal_snapshot_node|flags);
if (ret)
return ret;
}
if (back_split) {
update = bch2_bkey_make_mut_noupdate(trans, old);
if ((ret = PTR_ERR_OR_ZERO(update)))
return ret;
bch2_cut_front(new.k->p, update);
ret = bch2_trans_update_by_path(trans, iter->path, update,
BTREE_UPDATE_internal_snapshot_node|
flags, _RET_IP_);
if (ret)
return ret;
}
return 0;
}
static int bch2_trans_update_extent(struct btree_trans *trans,
struct btree_iter *orig_iter,
struct bkey_i *insert,
enum btree_iter_update_trigger_flags flags)
{
struct btree_iter iter;
struct bkey_s_c k;
enum btree_id btree_id = orig_iter->btree_id;
int ret = 0;
bch2_trans_iter_init(trans, &iter, btree_id, bkey_start_pos(&insert->k),
BTREE_ITER_intent|
BTREE_ITER_with_updates|
BTREE_ITER_not_extents);
k = bch2_btree_iter_peek_max(trans, &iter, POS(insert->k.p.inode, U64_MAX));
if ((ret = bkey_err(k)))
goto err;
if (!k.k)
goto out;
if (bkey_eq(k.k->p, bkey_start_pos(&insert->k))) {
if (bch2_bkey_maybe_mergable(k.k, &insert->k)) {
ret = extent_front_merge(trans, &iter, k, &insert, flags);
if (ret)
goto err;
}
goto next;
}
while (bkey_gt(insert->k.p, bkey_start_pos(k.k))) {
bool done = bkey_lt(insert->k.p, k.k->p);
ret = bch2_trans_update_extent_overwrite(trans, &iter, flags, k, bkey_i_to_s_c(insert));
if (ret)
goto err;
if (done)
goto out;
next:
bch2_btree_iter_advance(trans, &iter);
k = bch2_btree_iter_peek_max(trans, &iter, POS(insert->k.p.inode, U64_MAX));
if ((ret = bkey_err(k)))
goto err;
if (!k.k)
goto out;
}
if (bch2_bkey_maybe_mergable(&insert->k, k.k)) {
ret = extent_back_merge(trans, &iter, insert, k);
if (ret)
goto err;
}
out:
if (!bkey_deleted(&insert->k))
ret = bch2_btree_insert_nonextent(trans, btree_id, insert, flags);
err:
bch2_trans_iter_exit(trans, &iter);
return ret;
}
static noinline int flush_new_cached_update(struct btree_trans *trans,
struct btree_insert_entry *i,
enum btree_iter_update_trigger_flags flags,
unsigned long ip)
{
struct bkey k;
int ret;
btree_path_idx_t path_idx =
bch2_path_get(trans, i->btree_id, i->old_k.p, 1, 0,
BTREE_ITER_intent, _THIS_IP_);
ret = bch2_btree_path_traverse(trans, path_idx, 0);
if (ret)
goto out;
struct btree_path *btree_path = trans->paths + path_idx;
/*
* The old key in the insert entry might actually refer to an existing
* key in the btree that has been deleted from cache and not yet
* flushed. Check for this and skip the flush so we don't run triggers
* against a stale key.
*/
bch2_btree_path_peek_slot_exact(btree_path, &k);
if (!bkey_deleted(&k))
goto out;
i->key_cache_already_flushed = true;
i->flags |= BTREE_TRIGGER_norun;
btree_path_set_should_be_locked(trans, btree_path);
ret = bch2_trans_update_by_path(trans, path_idx, i->k, flags, ip);
out:
bch2_path_put(trans, path_idx, true);
return ret;
}
static int __must_check
bch2_trans_update_by_path(struct btree_trans *trans, btree_path_idx_t path_idx,
struct bkey_i *k, enum btree_iter_update_trigger_flags flags,
unsigned long ip)
{
struct bch_fs *c = trans->c;
struct btree_insert_entry *i, n;
int cmp;
struct btree_path *path = trans->paths + path_idx;
EBUG_ON(!path->should_be_locked);
EBUG_ON(trans->nr_updates >= trans->nr_paths);
EBUG_ON(!bpos_eq(k->k.p, path->pos));
n = (struct btree_insert_entry) {
.flags = flags,
.sort_order = btree_trigger_order(path->btree_id),
.bkey_type = __btree_node_type(path->level, path->btree_id),
.btree_id = path->btree_id,
.level = path->level,
.cached = path->cached,
.path = path_idx,
.k = k,
.ip_allocated = ip,
};
#ifdef CONFIG_BCACHEFS_DEBUG
trans_for_each_update(trans, i)
BUG_ON(i != trans->updates &&
btree_insert_entry_cmp(i - 1, i) >= 0);
#endif
/*
* Pending updates are kept sorted: first, find position of new update,
* then delete/trim any updates the new update overwrites:
*/
for (i = trans->updates; i < trans->updates + trans->nr_updates; i++) {
cmp = btree_insert_entry_cmp(&n, i);
if (cmp <= 0)
break;
}
bool overwrite = !cmp && i < trans->updates + trans->nr_updates;
if (overwrite) {
EBUG_ON(i->insert_trigger_run || i->overwrite_trigger_run);
bch2_path_put(trans, i->path, true);
i->flags = n.flags;
i->cached = n.cached;
i->k = n.k;
i->path = n.path;
i->ip_allocated = n.ip_allocated;
} else {
array_insert_item(trans->updates, trans->nr_updates,
i - trans->updates, n);
i->old_v = bch2_btree_path_peek_slot_exact(path, &i->old_k).v;
i->old_btree_u64s = !bkey_deleted(&i->old_k) ? i->old_k.u64s : 0;
if (unlikely(trans->journal_replay_not_finished)) {
struct bkey_i *j_k =
bch2_journal_keys_peek_slot(c, n.btree_id, n.level, k->k.p);
if (j_k) {
i->old_k = j_k->k;
i->old_v = &j_k->v;
}
}
}
__btree_path_get(trans, trans->paths + i->path, true);
trace_update_by_path(trans, path, i, overwrite);
/*
* If a key is present in the key cache, it must also exist in the
* btree - this is necessary for cache coherency. When iterating over
* a btree that's cached in the key cache, the btree iter code checks
* the key cache - but the key has to exist in the btree for that to
* work:
*/
if (path->cached && !i->old_btree_u64s)
return flush_new_cached_update(trans, i, flags, ip);
return 0;
}
static noinline int bch2_trans_update_get_key_cache(struct btree_trans *trans,
struct btree_iter *iter,
struct btree_path *path)
{
struct btree_path *key_cache_path = btree_iter_key_cache_path(trans, iter);
if (!key_cache_path ||
!key_cache_path->should_be_locked ||
!bpos_eq(key_cache_path->pos, iter->pos)) {
struct bkey_cached *ck;
int ret;
if (!iter->key_cache_path)
iter->key_cache_path =
bch2_path_get(trans, path->btree_id, path->pos, 1, 0,
BTREE_ITER_intent|
BTREE_ITER_cached, _THIS_IP_);
iter->key_cache_path =
bch2_btree_path_set_pos(trans, iter->key_cache_path, path->pos,
iter->flags & BTREE_ITER_intent,
_THIS_IP_);
ret = bch2_btree_path_traverse(trans, iter->key_cache_path, BTREE_ITER_cached);
if (unlikely(ret))
return ret;
ck = (void *) trans->paths[iter->key_cache_path].l[0].b;
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
trace_and_count(trans->c, trans_restart_key_cache_raced, trans, _RET_IP_);
return btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_raced);
}
btree_path_set_should_be_locked(trans, trans->paths + iter->key_cache_path);
}
return 0;
}
int __must_check bch2_trans_update_ip(struct btree_trans *trans, struct btree_iter *iter,
struct bkey_i *k, enum btree_iter_update_trigger_flags flags,
unsigned long ip)
{
kmsan_check_memory(k, bkey_bytes(&k->k));
btree_path_idx_t path_idx = iter->update_path ?: iter->path;
int ret;
if (iter->flags & BTREE_ITER_is_extents)
return bch2_trans_update_extent(trans, iter, k, flags);
if (bkey_deleted(&k->k) &&
!(flags & BTREE_UPDATE_key_cache_reclaim) &&
(iter->flags & BTREE_ITER_filter_snapshots)) {
ret = need_whiteout_for_snapshot(trans, iter->btree_id, k->k.p);
if (unlikely(ret < 0))
return ret;
if (ret)
k->k.type = KEY_TYPE_whiteout;
}
/*
* Ensure that updates to cached btrees go to the key cache:
*/
struct btree_path *path = trans->paths + path_idx;
if (!(flags & BTREE_UPDATE_key_cache_reclaim) &&
!path->cached &&
!path->level &&
btree_id_cached(trans->c, path->btree_id)) {
ret = bch2_trans_update_get_key_cache(trans, iter, path);
if (ret)
return ret;
path_idx = iter->key_cache_path;
}
return bch2_trans_update_by_path(trans, path_idx, k, flags, ip);
}
int bch2_btree_insert_clone_trans(struct btree_trans *trans,
enum btree_id btree,
struct bkey_i *k)
{
struct bkey_i *n = bch2_trans_kmalloc(trans, bkey_bytes(&k->k));
int ret = PTR_ERR_OR_ZERO(n);
if (ret)
return ret;
bkey_copy(n, k);
return bch2_btree_insert_trans(trans, btree, n, 0);
}
void *__bch2_trans_subbuf_alloc(struct btree_trans *trans,
struct btree_trans_subbuf *buf,
unsigned u64s)
{
unsigned new_top = buf->u64s + u64s;
unsigned new_size = buf->size;
BUG_ON(roundup_pow_of_two(new_top) > U16_MAX);
if (new_top > new_size)
new_size = roundup_pow_of_two(new_top);
void *n = bch2_trans_kmalloc_nomemzero(trans, new_size * sizeof(u64));
if (IS_ERR(n))
return n;
unsigned offset = (u64 *) n - (u64 *) trans->mem;
BUG_ON(offset > U16_MAX);
if (buf->u64s)
memcpy(n,
btree_trans_subbuf_base(trans, buf),
buf->size * sizeof(u64));
buf->base = (u64 *) n - (u64 *) trans->mem;
buf->size = new_size;
void *p = btree_trans_subbuf_top(trans, buf);
buf->u64s = new_top;
return p;
}
int bch2_bkey_get_empty_slot(struct btree_trans *trans, struct btree_iter *iter,
enum btree_id btree, struct bpos end)
{
bch2_trans_iter_init(trans, iter, btree, end, BTREE_ITER_intent);
struct bkey_s_c k = bch2_btree_iter_peek_prev(trans, iter);
int ret = bkey_err(k);
if (ret)
goto err;
bch2_btree_iter_advance(trans, iter);
k = bch2_btree_iter_peek_slot(trans, iter);
ret = bkey_err(k);
if (ret)
goto err;
BUG_ON(k.k->type != KEY_TYPE_deleted);
if (bkey_gt(k.k->p, end)) {
ret = bch_err_throw(trans->c, ENOSPC_btree_slot);
goto err;
}
return 0;
err:
bch2_trans_iter_exit(trans, iter);
return ret;
}
void bch2_trans_commit_hook(struct btree_trans *trans,
struct btree_trans_commit_hook *h)
{
h->next = trans->hooks;
trans->hooks = h;
}
int bch2_btree_insert_nonextent(struct btree_trans *trans,
enum btree_id btree, struct bkey_i *k,
enum btree_iter_update_trigger_flags flags)
{
struct btree_iter iter;
int ret;
bch2_trans_iter_init(trans, &iter, btree, k->k.p,
BTREE_ITER_cached|
BTREE_ITER_not_extents|
BTREE_ITER_intent);
ret = bch2_btree_iter_traverse(trans, &iter) ?:
bch2_trans_update(trans, &iter, k, flags);
bch2_trans_iter_exit(trans, &iter);
return ret;
}
int bch2_btree_insert_trans(struct btree_trans *trans, enum btree_id id,
struct bkey_i *k, enum btree_iter_update_trigger_flags flags)
{
struct btree_iter iter;
bch2_trans_iter_init(trans, &iter, id, bkey_start_pos(&k->k),
BTREE_ITER_intent|flags);
int ret = bch2_btree_iter_traverse(trans, &iter) ?:
bch2_trans_update(trans, &iter, k, flags);
bch2_trans_iter_exit(trans, &iter);
return ret;
}
/**
* bch2_btree_insert - insert keys into the extent btree
* @c: pointer to struct bch_fs
* @id: btree to insert into
* @k: key to insert
* @disk_res: must be non-NULL whenever inserting or potentially
* splitting data extents
* @flags: transaction commit flags
* @iter_flags: btree iter update trigger flags
*
* Returns: 0 on success, error code on failure
*/
int bch2_btree_insert(struct bch_fs *c, enum btree_id id, struct bkey_i *k,
struct disk_reservation *disk_res, int flags,
enum btree_iter_update_trigger_flags iter_flags)
{
return bch2_trans_commit_do(c, disk_res, NULL, flags,
bch2_btree_insert_trans(trans, id, k, iter_flags));
}
int bch2_btree_delete_at(struct btree_trans *trans,
struct btree_iter *iter, unsigned update_flags)
{
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k));
int ret = PTR_ERR_OR_ZERO(k);
if (ret)
return ret;
bkey_init(&k->k);
k->k.p = iter->pos;
return bch2_trans_update(trans, iter, k, update_flags);
}
int bch2_btree_delete(struct btree_trans *trans,
enum btree_id btree, struct bpos pos,
unsigned update_flags)
{
struct btree_iter iter;
int ret;
bch2_trans_iter_init(trans, &iter, btree, pos,
BTREE_ITER_cached|
BTREE_ITER_intent);
ret = bch2_btree_iter_traverse(trans, &iter) ?:
bch2_btree_delete_at(trans, &iter, update_flags);
bch2_trans_iter_exit(trans, &iter);
return ret;
}
int bch2_btree_delete_range_trans(struct btree_trans *trans, enum btree_id id,
struct bpos start, struct bpos end,
unsigned update_flags,
u64 *journal_seq)
{
u32 restart_count = trans->restart_count;
struct btree_iter iter;
struct bkey_s_c k;
int ret = 0;
bch2_trans_iter_init(trans, &iter, id, start, BTREE_ITER_intent);
while ((k = bch2_btree_iter_peek_max(trans, &iter, end)).k) {
struct disk_reservation disk_res =
bch2_disk_reservation_init(trans->c, 0);
struct bkey_i delete;
ret = bkey_err(k);
if (ret)
goto err;
bkey_init(&delete.k);
/*
* This could probably be more efficient for extents:
*/
/*
* For extents, iter.pos won't necessarily be the same as
* bkey_start_pos(k.k) (for non extents they always will be the
* same). It's important that we delete starting from iter.pos
* because the range we want to delete could start in the middle
* of k.
*
* (bch2_btree_iter_peek() does guarantee that iter.pos >=
* bkey_start_pos(k.k)).
*/
delete.k.p = iter.pos;
if (iter.flags & BTREE_ITER_is_extents)
bch2_key_resize(&delete.k,
bpos_min(end, k.k->p).offset -
iter.pos.offset);
ret = bch2_trans_update(trans, &iter, &delete, update_flags) ?:
bch2_trans_commit(trans, &disk_res, journal_seq,
BCH_TRANS_COMMIT_no_enospc);
bch2_disk_reservation_put(trans->c, &disk_res);
err:
/*
* the bch2_trans_begin() call is in a weird place because we
* need to call it after every transaction commit, to avoid path
* overflow, but don't want to call it if the delete operation
* is a no-op and we have no work to do:
*/
bch2_trans_begin(trans);
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
ret = 0;
if (ret)
break;
}
bch2_trans_iter_exit(trans, &iter);
return ret ?: trans_was_restarted(trans, restart_count);
}
/*
* bch_btree_delete_range - delete everything within a given range
*
* Range is a half open interval - [start, end)
*/
int bch2_btree_delete_range(struct bch_fs *c, enum btree_id id,
struct bpos start, struct bpos end,
unsigned update_flags,
u64 *journal_seq)
{
int ret = bch2_trans_run(c,
bch2_btree_delete_range_trans(trans, id, start, end,
update_flags, journal_seq));
if (ret == -BCH_ERR_transaction_restart_nested)
ret = 0;
return ret;
}
int bch2_btree_bit_mod_iter(struct btree_trans *trans, struct btree_iter *iter, bool set)
{
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k));
int ret = PTR_ERR_OR_ZERO(k);
if (ret)
return ret;
bkey_init(&k->k);
k->k.type = set ? KEY_TYPE_set : KEY_TYPE_deleted;
k->k.p = iter->pos;
if (iter->flags & BTREE_ITER_is_extents)
bch2_key_resize(&k->k, 1);
return bch2_trans_update(trans, iter, k, 0);
}
int bch2_btree_bit_mod(struct btree_trans *trans, enum btree_id btree,
struct bpos pos, bool set)
{
struct btree_iter iter;
bch2_trans_iter_init(trans, &iter, btree, pos, BTREE_ITER_intent);
int ret = bch2_btree_iter_traverse(trans, &iter) ?:
bch2_btree_bit_mod_iter(trans, &iter, set);
bch2_trans_iter_exit(trans, &iter);
return ret;
}
int bch2_btree_bit_mod_buffered(struct btree_trans *trans, enum btree_id btree,
struct bpos pos, bool set)
{
struct bkey_i k;
bkey_init(&k.k);
k.k.type = set ? KEY_TYPE_set : KEY_TYPE_deleted;
k.k.p = pos;
return bch2_trans_update_buffered(trans, btree, &k);
}
static int __bch2_trans_log_str(struct btree_trans *trans, const char *str, unsigned len)
{
unsigned u64s = DIV_ROUND_UP(len, sizeof(u64));
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(u64s));
int ret = PTR_ERR_OR_ZERO(e);
if (ret)
return ret;
struct jset_entry_log *l = container_of(e, struct jset_entry_log, entry);
journal_entry_init(e, BCH_JSET_ENTRY_log, 0, 1, u64s);
memcpy_and_pad(l->d, u64s * sizeof(u64), str, len, 0);
return 0;
}
int bch2_trans_log_str(struct btree_trans *trans, const char *str)
{
return __bch2_trans_log_str(trans, str, strlen(str));
}
int bch2_trans_log_msg(struct btree_trans *trans, struct printbuf *buf)
{
int ret = buf->allocation_failure ? -BCH_ERR_ENOMEM_trans_log_msg : 0;
if (ret)
return ret;
return __bch2_trans_log_str(trans, buf->buf, buf->pos);
}
int bch2_trans_log_bkey(struct btree_trans *trans, enum btree_id btree,
unsigned level, struct bkey_i *k)
{
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(k->k.u64s));
int ret = PTR_ERR_OR_ZERO(e);
if (ret)
return ret;
journal_entry_init(e, BCH_JSET_ENTRY_log_bkey, btree, level, k->k.u64s);
bkey_copy(e->start, k);
return 0;
}
__printf(3, 0)
static int
__bch2_fs_log_msg(struct bch_fs *c, unsigned commit_flags, const char *fmt,
va_list args)
{
struct printbuf buf = PRINTBUF;
prt_vprintf(&buf, fmt, args);
unsigned u64s = DIV_ROUND_UP(buf.pos, sizeof(u64));
int ret = buf.allocation_failure ? -BCH_ERR_ENOMEM_trans_log_msg : 0;
if (ret)
goto err;
if (!test_bit(JOURNAL_running, &c->journal.flags)) {
ret = darray_make_room(&c->journal.early_journal_entries, jset_u64s(u64s));
if (ret)
goto err;
struct jset_entry_log *l = (void *) &darray_top(c->journal.early_journal_entries);
journal_entry_init(&l->entry, BCH_JSET_ENTRY_log, 0, 1, u64s);
memcpy_and_pad(l->d, u64s * sizeof(u64), buf.buf, buf.pos, 0);
c->journal.early_journal_entries.nr += jset_u64s(u64s);
} else {
ret = bch2_trans_commit_do(c, NULL, NULL, commit_flags,
bch2_trans_log_msg(trans, &buf));
}
err:
printbuf_exit(&buf);
return ret;
}
__printf(2, 3)
int bch2_fs_log_msg(struct bch_fs *c, const char *fmt, ...)
{
va_list args;
int ret;
va_start(args, fmt);
ret = __bch2_fs_log_msg(c, 0, fmt, args);
va_end(args);
return ret;
}
/*
* Use for logging messages during recovery to enable reserved space and avoid
* blocking.
*/
__printf(2, 3)
int bch2_journal_log_msg(struct bch_fs *c, const char *fmt, ...)
{
va_list args;
int ret;
va_start(args, fmt);
ret = __bch2_fs_log_msg(c, BCH_WATERMARK_reclaim, fmt, args);
va_end(args);
return ret;
}

View File

@ -1,429 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_UPDATE_H
#define _BCACHEFS_BTREE_UPDATE_H
#include "btree_iter.h"
#include "journal.h"
#include "snapshot.h"
struct bch_fs;
struct btree;
void bch2_btree_node_prep_for_write(struct btree_trans *,
struct btree_path *, struct btree *);
bool bch2_btree_bset_insert_key(struct btree_trans *, struct btree_path *,
struct btree *, struct btree_node_iter *,
struct bkey_i *);
int bch2_btree_node_flush0(struct journal *, struct journal_entry_pin *, u64);
int bch2_btree_node_flush1(struct journal *, struct journal_entry_pin *, u64);
void bch2_btree_add_journal_pin(struct bch_fs *, struct btree *, u64);
void bch2_btree_insert_key_leaf(struct btree_trans *, struct btree_path *,
struct bkey_i *, u64);
#define BCH_TRANS_COMMIT_FLAGS() \
x(no_enospc, "don't check for enospc") \
x(no_check_rw, "don't attempt to take a ref on c->writes") \
x(no_journal_res, "don't take a journal reservation, instead " \
"pin journal entry referred to by trans->journal_res.seq") \
x(journal_reclaim, "operation required for journal reclaim; may return error" \
"instead of deadlocking if BCH_WATERMARK_reclaim not specified")\
x(skip_accounting_apply, "we're in journal replay - accounting updates have already been applied")
enum __bch_trans_commit_flags {
/* First bits for bch_watermark: */
__BCH_TRANS_COMMIT_FLAGS_START = BCH_WATERMARK_BITS,
#define x(n, ...) __BCH_TRANS_COMMIT_##n,
BCH_TRANS_COMMIT_FLAGS()
#undef x
};
enum bch_trans_commit_flags {
#define x(n, ...) BCH_TRANS_COMMIT_##n = BIT(__BCH_TRANS_COMMIT_##n),
BCH_TRANS_COMMIT_FLAGS()
#undef x
};
void bch2_trans_commit_flags_to_text(struct printbuf *, enum bch_trans_commit_flags);
int bch2_btree_delete_at(struct btree_trans *, struct btree_iter *, unsigned);
int bch2_btree_delete(struct btree_trans *, enum btree_id, struct bpos, unsigned);
int bch2_btree_insert_nonextent(struct btree_trans *, enum btree_id,
struct bkey_i *, enum btree_iter_update_trigger_flags);
int bch2_btree_insert_trans(struct btree_trans *, enum btree_id, struct bkey_i *,
enum btree_iter_update_trigger_flags);
int bch2_btree_insert(struct bch_fs *, enum btree_id, struct bkey_i *, struct
disk_reservation *, int flags, enum
btree_iter_update_trigger_flags iter_flags);
int bch2_btree_delete_range_trans(struct btree_trans *, enum btree_id,
struct bpos, struct bpos, unsigned, u64 *);
int bch2_btree_delete_range(struct bch_fs *, enum btree_id,
struct bpos, struct bpos, unsigned, u64 *);
int bch2_btree_bit_mod_iter(struct btree_trans *, struct btree_iter *, bool);
int bch2_btree_bit_mod(struct btree_trans *, enum btree_id, struct bpos, bool);
int bch2_btree_bit_mod_buffered(struct btree_trans *, enum btree_id, struct bpos, bool);
static inline int bch2_btree_delete_at_buffered(struct btree_trans *trans,
enum btree_id btree, struct bpos pos)
{
return bch2_btree_bit_mod_buffered(trans, btree, pos, false);
}
int __bch2_insert_snapshot_whiteouts(struct btree_trans *, enum btree_id,
struct bpos, snapshot_id_list *);
/*
* For use when splitting extents in existing snapshots:
*
* If @old_pos is an interior snapshot node, iterate over descendent snapshot
* nodes: for every descendent snapshot in whiche @old_pos is overwritten and
* not visible, emit a whiteout at @new_pos.
*/
static inline int bch2_insert_snapshot_whiteouts(struct btree_trans *trans,
enum btree_id btree,
struct bpos old_pos,
struct bpos new_pos)
{
BUG_ON(old_pos.snapshot != new_pos.snapshot);
if (!btree_type_has_snapshots(btree) ||
bkey_eq(old_pos, new_pos))
return 0;
snapshot_id_list s;
int ret = bch2_get_snapshot_overwrites(trans, btree, old_pos, &s);
if (ret)
return ret;
return s.nr
? __bch2_insert_snapshot_whiteouts(trans, btree, new_pos, &s)
: 0;
}
int bch2_trans_update_extent_overwrite(struct btree_trans *, struct btree_iter *,
enum btree_iter_update_trigger_flags,
struct bkey_s_c, struct bkey_s_c);
int bch2_bkey_get_empty_slot(struct btree_trans *, struct btree_iter *,
enum btree_id, struct bpos);
int __must_check bch2_trans_update_ip(struct btree_trans *, struct btree_iter *,
struct bkey_i *, enum btree_iter_update_trigger_flags,
unsigned long);
static inline int __must_check
bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter,
struct bkey_i *k, enum btree_iter_update_trigger_flags flags)
{
return bch2_trans_update_ip(trans, iter, k, flags, _THIS_IP_);
}
static inline void *btree_trans_subbuf_base(struct btree_trans *trans,
struct btree_trans_subbuf *buf)
{
return (u64 *) trans->mem + buf->base;
}
static inline void *btree_trans_subbuf_top(struct btree_trans *trans,
struct btree_trans_subbuf *buf)
{
return (u64 *) trans->mem + buf->base + buf->u64s;
}
void *__bch2_trans_subbuf_alloc(struct btree_trans *,
struct btree_trans_subbuf *,
unsigned);
static inline void *
bch2_trans_subbuf_alloc(struct btree_trans *trans,
struct btree_trans_subbuf *buf,
unsigned u64s)
{
if (buf->u64s + u64s > buf->size)
return __bch2_trans_subbuf_alloc(trans, buf, u64s);
void *p = btree_trans_subbuf_top(trans, buf);
buf->u64s += u64s;
return p;
}
static inline struct jset_entry *btree_trans_journal_entries_start(struct btree_trans *trans)
{
return btree_trans_subbuf_base(trans, &trans->journal_entries);
}
static inline struct jset_entry *btree_trans_journal_entries_top(struct btree_trans *trans)
{
return btree_trans_subbuf_top(trans, &trans->journal_entries);
}
static inline struct jset_entry *
bch2_trans_jset_entry_alloc(struct btree_trans *trans, unsigned u64s)
{
return bch2_trans_subbuf_alloc(trans, &trans->journal_entries, u64s);
}
int bch2_btree_insert_clone_trans(struct btree_trans *, enum btree_id, struct bkey_i *);
int bch2_btree_write_buffer_insert_err(struct bch_fs *, enum btree_id, struct bkey_i *);
static inline int __must_check bch2_trans_update_buffered(struct btree_trans *trans,
enum btree_id btree,
struct bkey_i *k)
{
kmsan_check_memory(k, bkey_bytes(&k->k));
EBUG_ON(k->k.u64s > BTREE_WRITE_BUFERED_U64s_MAX);
if (unlikely(!btree_type_uses_write_buffer(btree))) {
int ret = bch2_btree_write_buffer_insert_err(trans->c, btree, k);
dump_stack();
return ret;
}
/*
* Most updates skip the btree write buffer until journal replay is
* finished because synchronization with journal replay relies on having
* a btree node locked - if we're overwriting a key in the journal that
* journal replay hasn't yet replayed, we have to mark it as
* overwritten.
*
* But accounting updates don't overwrite, they're deltas, and they have
* to be flushed to the btree strictly in order for journal replay to be
* able to tell which updates need to be applied:
*/
if (k->k.type != KEY_TYPE_accounting &&
unlikely(trans->journal_replay_not_finished))
return bch2_btree_insert_clone_trans(trans, btree, k);
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(k->k.u64s));
int ret = PTR_ERR_OR_ZERO(e);
if (ret)
return ret;
journal_entry_init(e, BCH_JSET_ENTRY_write_buffer_keys, btree, 0, k->k.u64s);
bkey_copy(e->start, k);
return 0;
}
void bch2_trans_commit_hook(struct btree_trans *,
struct btree_trans_commit_hook *);
int __bch2_trans_commit(struct btree_trans *, unsigned);
int bch2_trans_log_str(struct btree_trans *, const char *);
int bch2_trans_log_msg(struct btree_trans *, struct printbuf *);
int bch2_trans_log_bkey(struct btree_trans *, enum btree_id, unsigned, struct bkey_i *);
__printf(2, 3) int bch2_fs_log_msg(struct bch_fs *, const char *, ...);
__printf(2, 3) int bch2_journal_log_msg(struct bch_fs *, const char *, ...);
/**
* bch2_trans_commit - insert keys at given iterator positions
*
* This is main entry point for btree updates.
*
* Return values:
* -EROFS: filesystem read only
* -EIO: journal or btree node IO error
*/
static inline int bch2_trans_commit(struct btree_trans *trans,
struct disk_reservation *disk_res,
u64 *journal_seq,
unsigned flags)
{
trans->disk_res = disk_res;
trans->journal_seq = journal_seq;
return __bch2_trans_commit(trans, flags);
}
#define commit_do(_trans, _disk_res, _journal_seq, _flags, _do) \
lockrestart_do(_trans, _do ?: bch2_trans_commit(_trans, (_disk_res),\
(_journal_seq), (_flags)))
#define nested_commit_do(_trans, _disk_res, _journal_seq, _flags, _do) \
nested_lockrestart_do(_trans, _do ?: bch2_trans_commit(_trans, (_disk_res),\
(_journal_seq), (_flags)))
#define bch2_trans_commit_do(_c, _disk_res, _journal_seq, _flags, _do) \
bch2_trans_run(_c, commit_do(trans, _disk_res, _journal_seq, _flags, _do))
#define trans_for_each_update(_trans, _i) \
for (struct btree_insert_entry *_i = (_trans)->updates; \
(_i) < (_trans)->updates + (_trans)->nr_updates; \
(_i)++)
static inline void bch2_trans_reset_updates(struct btree_trans *trans)
{
trans_for_each_update(trans, i)
bch2_path_put(trans, i->path, true);
trans->nr_updates = 0;
trans->journal_entries.u64s = 0;
trans->journal_entries.size = 0;
trans->accounting.u64s = 0;
trans->accounting.size = 0;
trans->hooks = NULL;
trans->extra_disk_res = 0;
}
static __always_inline struct bkey_i *__bch2_bkey_make_mut_noupdate(struct btree_trans *trans, struct bkey_s_c k,
unsigned type, unsigned min_bytes)
{
unsigned bytes = max_t(unsigned, min_bytes, bkey_bytes(k.k));
struct bkey_i *mut;
if (type && k.k->type != type)
return ERR_PTR(-ENOENT);
/* extra padding for varint_decode_fast... */
mut = bch2_trans_kmalloc_nomemzero(trans, bytes + 8);
if (!IS_ERR(mut)) {
bkey_reassemble(mut, k);
if (unlikely(bytes > bkey_bytes(k.k))) {
memset((void *) mut + bkey_bytes(k.k), 0,
bytes - bkey_bytes(k.k));
mut->k.u64s = DIV_ROUND_UP(bytes, sizeof(u64));
}
}
return mut;
}
static __always_inline struct bkey_i *bch2_bkey_make_mut_noupdate(struct btree_trans *trans, struct bkey_s_c k)
{
return __bch2_bkey_make_mut_noupdate(trans, k, 0, 0);
}
#define bch2_bkey_make_mut_noupdate_typed(_trans, _k, _type) \
bkey_i_to_##_type(__bch2_bkey_make_mut_noupdate(_trans, _k, \
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
static inline struct bkey_i *__bch2_bkey_make_mut(struct btree_trans *trans, struct btree_iter *iter,
struct bkey_s_c *k,
enum btree_iter_update_trigger_flags flags,
unsigned type, unsigned min_bytes)
{
struct bkey_i *mut = __bch2_bkey_make_mut_noupdate(trans, *k, type, min_bytes);
int ret;
if (IS_ERR(mut))
return mut;
ret = bch2_trans_update(trans, iter, mut, flags);
if (ret)
return ERR_PTR(ret);
*k = bkey_i_to_s_c(mut);
return mut;
}
static inline struct bkey_i *bch2_bkey_make_mut(struct btree_trans *trans,
struct btree_iter *iter, struct bkey_s_c *k,
enum btree_iter_update_trigger_flags flags)
{
return __bch2_bkey_make_mut(trans, iter, k, flags, 0, 0);
}
#define bch2_bkey_make_mut_typed(_trans, _iter, _k, _flags, _type) \
bkey_i_to_##_type(__bch2_bkey_make_mut(_trans, _iter, _k, _flags,\
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
static inline struct bkey_i *__bch2_bkey_get_mut_noupdate(struct btree_trans *trans,
struct btree_iter *iter,
unsigned btree_id, struct bpos pos,
enum btree_iter_update_trigger_flags flags,
unsigned type, unsigned min_bytes)
{
struct bkey_s_c k = __bch2_bkey_get_iter(trans, iter,
btree_id, pos, flags|BTREE_ITER_intent, type);
struct bkey_i *ret = IS_ERR(k.k)
? ERR_CAST(k.k)
: __bch2_bkey_make_mut_noupdate(trans, k, 0, min_bytes);
if (IS_ERR(ret))
bch2_trans_iter_exit(trans, iter);
return ret;
}
static inline struct bkey_i *bch2_bkey_get_mut_noupdate(struct btree_trans *trans,
struct btree_iter *iter,
unsigned btree_id, struct bpos pos,
enum btree_iter_update_trigger_flags flags)
{
return __bch2_bkey_get_mut_noupdate(trans, iter, btree_id, pos, flags, 0, 0);
}
static inline struct bkey_i *__bch2_bkey_get_mut(struct btree_trans *trans,
struct btree_iter *iter,
unsigned btree_id, struct bpos pos,
enum btree_iter_update_trigger_flags flags,
unsigned type, unsigned min_bytes)
{
struct bkey_i *mut = __bch2_bkey_get_mut_noupdate(trans, iter,
btree_id, pos, flags|BTREE_ITER_intent, type, min_bytes);
int ret;
if (IS_ERR(mut))
return mut;
ret = bch2_trans_update(trans, iter, mut, flags);
if (ret) {
bch2_trans_iter_exit(trans, iter);
return ERR_PTR(ret);
}
return mut;
}
static inline struct bkey_i *bch2_bkey_get_mut_minsize(struct btree_trans *trans,
struct btree_iter *iter,
unsigned btree_id, struct bpos pos,
enum btree_iter_update_trigger_flags flags,
unsigned min_bytes)
{
return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, min_bytes);
}
static inline struct bkey_i *bch2_bkey_get_mut(struct btree_trans *trans,
struct btree_iter *iter,
unsigned btree_id, struct bpos pos,
enum btree_iter_update_trigger_flags flags)
{
return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, 0);
}
#define bch2_bkey_get_mut_typed(_trans, _iter, _btree_id, _pos, _flags, _type)\
bkey_i_to_##_type(__bch2_bkey_get_mut(_trans, _iter, \
_btree_id, _pos, _flags, \
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
static inline struct bkey_i *__bch2_bkey_alloc(struct btree_trans *trans, struct btree_iter *iter,
enum btree_iter_update_trigger_flags flags,
unsigned type, unsigned val_size)
{
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k) + val_size);
int ret;
if (IS_ERR(k))
return k;
bkey_init(&k->k);
k->k.p = iter->pos;
k->k.type = type;
set_bkey_val_bytes(&k->k, val_size);
ret = bch2_trans_update(trans, iter, k, flags);
if (unlikely(ret))
return ERR_PTR(ret);
return k;
}
#define bch2_bkey_alloc(_trans, _iter, _flags, _type) \
bkey_i_to_##_type(__bch2_bkey_alloc(_trans, _iter, _flags, \
KEY_TYPE_##_type, sizeof(struct bch_##_type)))
#endif /* _BCACHEFS_BTREE_UPDATE_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,364 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_UPDATE_INTERIOR_H
#define _BCACHEFS_BTREE_UPDATE_INTERIOR_H
#include "btree_cache.h"
#include "btree_locking.h"
#include "btree_update.h"
#define BTREE_UPDATE_NODES_MAX ((BTREE_MAX_DEPTH - 2) * 2 + GC_MERGE_NODES)
#define BTREE_UPDATE_JOURNAL_RES (BTREE_UPDATE_NODES_MAX * (BKEY_BTREE_PTR_U64s_MAX + 1))
int bch2_btree_node_check_topology(struct btree_trans *, struct btree *);
#define BTREE_UPDATE_MODES() \
x(none) \
x(node) \
x(root) \
x(update)
enum btree_update_mode {
#define x(n) BTREE_UPDATE_##n,
BTREE_UPDATE_MODES()
#undef x
};
/*
* Tracks an in progress split/rewrite of a btree node and the update to the
* parent node:
*
* When we split/rewrite a node, we do all the updates in memory without
* waiting for any writes to complete - we allocate the new node(s) and update
* the parent node, possibly recursively up to the root.
*
* The end result is that we have one or more new nodes being written -
* possibly several, if there were multiple splits - and then a write (updating
* an interior node) which will make all these new nodes visible.
*
* Additionally, as we split/rewrite nodes we free the old nodes - but the old
* nodes can't be freed (their space on disk can't be reclaimed) until the
* update to the interior node that makes the new node visible completes -
* until then, the old nodes are still reachable on disk.
*
*/
struct btree_update {
struct closure cl;
struct bch_fs *c;
u64 start_time;
unsigned long ip_started;
struct list_head list;
struct list_head unwritten_list;
enum btree_update_mode mode;
enum bch_trans_commit_flags flags;
unsigned nodes_written:1;
unsigned took_gc_lock:1;
enum btree_id btree_id;
struct bpos node_start;
struct bpos node_end;
enum btree_node_rewrite_reason node_needed_rewrite;
u16 node_written;
u16 node_sectors;
u16 node_remaining;
unsigned update_level_start;
unsigned update_level_end;
struct disk_reservation disk_res;
/*
* BTREE_UPDATE_node:
* The update that made the new nodes visible was a regular update to an
* existing interior node - @b. We can't write out the update to @b
* until the new nodes we created are finished writing, so we block @b
* from writing by putting this btree_interior update on the
* @b->write_blocked list with @write_blocked_list:
*/
struct btree *b;
struct list_head write_blocked_list;
/*
* We may be freeing nodes that were dirty, and thus had journal entries
* pinned: we need to transfer the oldest of those pins to the
* btree_update operation, and release it when the new node(s)
* are all persistent and reachable:
*/
struct journal_entry_pin journal;
/* Preallocated nodes we reserve when we start the update: */
struct prealloc_nodes {
struct btree *b[BTREE_UPDATE_NODES_MAX];
unsigned nr;
} prealloc_nodes[2];
/* Nodes being freed: */
struct keylist old_keys;
u64 _old_keys[BTREE_UPDATE_NODES_MAX *
BKEY_BTREE_PTR_U64s_MAX];
/* Nodes being added: */
struct keylist new_keys;
u64 _new_keys[BTREE_UPDATE_NODES_MAX *
BKEY_BTREE_PTR_U64s_MAX];
/* New nodes, that will be made reachable by this update: */
struct btree *new_nodes[BTREE_UPDATE_NODES_MAX];
unsigned nr_new_nodes;
struct btree *old_nodes[BTREE_UPDATE_NODES_MAX];
__le64 old_nodes_seq[BTREE_UPDATE_NODES_MAX];
unsigned nr_old_nodes;
open_bucket_idx_t open_buckets[BTREE_UPDATE_NODES_MAX *
BCH_REPLICAS_MAX];
open_bucket_idx_t nr_open_buckets;
unsigned journal_u64s;
u64 journal_entries[BTREE_UPDATE_JOURNAL_RES];
/* Only here to reduce stack usage on recursive splits: */
struct keylist parent_keys;
/*
* Enough room for btree_split's keys without realloc - btree node
* pointers never have crc/compression info, so we only need to acount
* for the pointers for three keys
*/
u64 inline_keys[BKEY_BTREE_PTR_U64s_MAX * 3];
};
struct btree *__bch2_btree_node_alloc_replacement(struct btree_update *,
struct btree_trans *,
struct btree *,
struct bkey_format);
int bch2_btree_split_leaf(struct btree_trans *, btree_path_idx_t, unsigned);
int bch2_btree_increase_depth(struct btree_trans *, btree_path_idx_t, unsigned);
int __bch2_foreground_maybe_merge(struct btree_trans *, btree_path_idx_t,
unsigned, unsigned, enum btree_node_sibling);
static inline int bch2_foreground_maybe_merge_sibling(struct btree_trans *trans,
btree_path_idx_t path_idx,
unsigned level, unsigned flags,
enum btree_node_sibling sib)
{
struct btree_path *path = trans->paths + path_idx;
struct btree *b;
EBUG_ON(!btree_node_locked(path, level));
if (static_branch_unlikely(&bch2_btree_node_merging_disabled))
return 0;
b = path->l[level].b;
if (b->sib_u64s[sib] > trans->c->btree_foreground_merge_threshold)
return 0;
return __bch2_foreground_maybe_merge(trans, path_idx, level, flags, sib);
}
static inline int bch2_foreground_maybe_merge(struct btree_trans *trans,
btree_path_idx_t path,
unsigned level,
unsigned flags)
{
bch2_trans_verify_not_unlocked_or_in_restart(trans);
return bch2_foreground_maybe_merge_sibling(trans, path, level, flags,
btree_prev_sib) ?:
bch2_foreground_maybe_merge_sibling(trans, path, level, flags,
btree_next_sib);
}
int bch2_btree_node_rewrite(struct btree_trans *, struct btree_iter *,
struct btree *, unsigned, unsigned);
int bch2_btree_node_rewrite_key(struct btree_trans *,
enum btree_id, unsigned,
struct bkey_i *, unsigned);
int bch2_btree_node_rewrite_pos(struct btree_trans *,
enum btree_id, unsigned,
struct bpos, unsigned, unsigned);
int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *,
struct btree *, unsigned);
void bch2_btree_node_rewrite_async(struct bch_fs *, struct btree *);
int bch2_btree_node_update_key(struct btree_trans *, struct btree_iter *,
struct btree *, struct bkey_i *,
unsigned, bool);
int bch2_btree_node_update_key_get_iter(struct btree_trans *, struct btree *,
struct bkey_i *, unsigned, bool);
void bch2_btree_set_root_for_read(struct bch_fs *, struct btree *);
int bch2_btree_root_alloc_fake_trans(struct btree_trans *, enum btree_id, unsigned);
void bch2_btree_root_alloc_fake(struct bch_fs *, enum btree_id, unsigned);
static inline unsigned btree_update_reserve_required(struct bch_fs *c,
struct btree *b)
{
unsigned depth = btree_node_root(c, b)->c.level + 1;
/*
* Number of nodes we might have to allocate in a worst case btree
* split operation - we split all the way up to the root, then allocate
* a new root, unless we're already at max depth:
*/
if (depth < BTREE_MAX_DEPTH)
return (depth - b->c.level) * 2 + 1;
else
return (depth - b->c.level) * 2 - 1;
}
static inline void btree_node_reset_sib_u64s(struct btree *b)
{
b->sib_u64s[0] = b->nr.live_u64s;
b->sib_u64s[1] = b->nr.live_u64s;
}
static inline void *btree_data_end(struct btree *b)
{
return (void *) b->data + btree_buf_bytes(b);
}
static inline struct bkey_packed *unwritten_whiteouts_start(struct btree *b)
{
return (void *) ((u64 *) btree_data_end(b) - b->whiteout_u64s);
}
static inline struct bkey_packed *unwritten_whiteouts_end(struct btree *b)
{
return btree_data_end(b);
}
static inline void *write_block(struct btree *b)
{
return (void *) b->data + (b->written << 9);
}
static inline bool __btree_addr_written(struct btree *b, void *p)
{
return p < write_block(b);
}
static inline bool bset_written(struct btree *b, struct bset *i)
{
return __btree_addr_written(b, i);
}
static inline bool bkey_written(struct btree *b, struct bkey_packed *k)
{
return __btree_addr_written(b, k);
}
static inline ssize_t __bch2_btree_u64s_remaining(struct btree *b, void *end)
{
ssize_t used = bset_byte_offset(b, end) / sizeof(u64) +
b->whiteout_u64s;
ssize_t total = btree_buf_bytes(b) >> 3;
/* Always leave one extra u64 for bch2_varint_decode: */
used++;
return total - used;
}
static inline size_t bch2_btree_keys_u64s_remaining(struct btree *b)
{
ssize_t remaining = __bch2_btree_u64s_remaining(b,
btree_bkey_last(b, bset_tree_last(b)));
BUG_ON(remaining < 0);
if (bset_written(b, btree_bset_last(b)))
return 0;
return remaining;
}
#define BTREE_WRITE_SET_U64s_BITS 9
static inline unsigned btree_write_set_buffer(struct btree *b)
{
/*
* Could buffer up larger amounts of keys for btrees with larger keys,
* pending benchmarking:
*/
return 8 << BTREE_WRITE_SET_U64s_BITS;
}
static inline struct btree_node_entry *want_new_bset(struct bch_fs *c, struct btree *b)
{
struct bset_tree *t = bset_tree_last(b);
struct btree_node_entry *bne = max(write_block(b),
(void *) btree_bkey_last(b, t));
ssize_t remaining_space =
__bch2_btree_u64s_remaining(b, bne->keys.start);
if (unlikely(bset_written(b, bset(b, t)))) {
if (b->written + block_sectors(c) <= btree_sectors(c))
return bne;
} else {
if (unlikely(bset_u64s(t) * sizeof(u64) > btree_write_set_buffer(b)) &&
remaining_space > (ssize_t) (btree_write_set_buffer(b) >> 3))
return bne;
}
return NULL;
}
static inline void push_whiteout(struct btree *b, struct bpos pos)
{
struct bkey_packed k;
BUG_ON(bch2_btree_keys_u64s_remaining(b) < BKEY_U64s);
EBUG_ON(btree_node_just_written(b));
if (!bkey_pack_pos(&k, pos, b)) {
struct bkey *u = (void *) &k;
bkey_init(u);
u->p = pos;
}
k.needs_whiteout = true;
b->whiteout_u64s += k.u64s;
bkey_p_copy(unwritten_whiteouts_start(b), &k);
}
/*
* write lock must be held on @b (else the dirty bset that we were going to
* insert into could be written out from under us)
*/
static inline bool bch2_btree_node_insert_fits(struct btree *b, unsigned u64s)
{
if (unlikely(btree_node_need_rewrite(b)))
return false;
return u64s <= bch2_btree_keys_u64s_remaining(b);
}
void bch2_btree_updates_to_text(struct printbuf *, struct bch_fs *);
bool bch2_btree_interior_updates_flush(struct bch_fs *);
void bch2_journal_entry_to_btree_root(struct bch_fs *, struct jset_entry *);
struct jset_entry *bch2_btree_roots_to_journal_entries(struct bch_fs *,
struct jset_entry *, unsigned long);
void bch2_async_btree_node_rewrites_flush(struct bch_fs *);
void bch2_do_pending_node_rewrites(struct bch_fs *);
void bch2_free_pending_node_rewrites(struct bch_fs *);
void bch2_btree_reserve_cache_to_text(struct printbuf *, struct bch_fs *);
void bch2_fs_btree_interior_update_exit(struct bch_fs *);
void bch2_fs_btree_interior_update_init_early(struct bch_fs *);
int bch2_fs_btree_interior_update_init(struct bch_fs *);
#endif /* _BCACHEFS_BTREE_UPDATE_INTERIOR_H */

View File

@ -1,893 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "bkey_buf.h"
#include "btree_locking.h"
#include "btree_update.h"
#include "btree_update_interior.h"
#include "btree_write_buffer.h"
#include "disk_accounting.h"
#include "enumerated_ref.h"
#include "error.h"
#include "extents.h"
#include "journal.h"
#include "journal_io.h"
#include "journal_reclaim.h"
#include <linux/prefetch.h>
#include <linux/sort.h>
static int bch2_btree_write_buffer_journal_flush(struct journal *,
struct journal_entry_pin *, u64);
static inline bool __wb_key_ref_cmp(const struct wb_key_ref *l, const struct wb_key_ref *r)
{
return (cmp_int(l->hi, r->hi) ?:
cmp_int(l->mi, r->mi) ?:
cmp_int(l->lo, r->lo)) >= 0;
}
static inline bool wb_key_ref_cmp(const struct wb_key_ref *l, const struct wb_key_ref *r)
{
#ifdef CONFIG_X86_64
int cmp;
asm("mov (%[l]), %%rax;"
"sub (%[r]), %%rax;"
"mov 8(%[l]), %%rax;"
"sbb 8(%[r]), %%rax;"
"mov 16(%[l]), %%rax;"
"sbb 16(%[r]), %%rax;"
: "=@ccae" (cmp)
: [l] "r" (l), [r] "r" (r)
: "rax", "cc");
EBUG_ON(cmp != __wb_key_ref_cmp(l, r));
return cmp;
#else
return __wb_key_ref_cmp(l, r);
#endif
}
static int wb_key_seq_cmp(const void *_l, const void *_r)
{
const struct btree_write_buffered_key *l = _l;
const struct btree_write_buffered_key *r = _r;
return cmp_int(l->journal_seq, r->journal_seq);
}
/* Compare excluding idx, the low 24 bits: */
static inline bool wb_key_eq(const void *_l, const void *_r)
{
const struct wb_key_ref *l = _l;
const struct wb_key_ref *r = _r;
return !((l->hi ^ r->hi)|
(l->mi ^ r->mi)|
((l->lo >> 24) ^ (r->lo >> 24)));
}
static noinline void wb_sort(struct wb_key_ref *base, size_t num)
{
size_t n = num, a = num / 2;
if (!a) /* num < 2 || size == 0 */
return;
for (;;) {
size_t b, c, d;
if (a) /* Building heap: sift down --a */
--a;
else if (--n) /* Sorting: Extract root to --n */
swap(base[0], base[n]);
else /* Sort complete */
break;
/*
* Sift element at "a" down into heap. This is the
* "bottom-up" variant, which significantly reduces
* calls to cmp_func(): we find the sift-down path all
* the way to the leaves (one compare per level), then
* backtrack to find where to insert the target element.
*
* Because elements tend to sift down close to the leaves,
* this uses fewer compares than doing two per level
* on the way down. (A bit more than half as many on
* average, 3/4 worst-case.)
*/
for (b = a; c = 2*b + 1, (d = c + 1) < n;)
b = wb_key_ref_cmp(base + c, base + d) ? c : d;
if (d == n) /* Special case last leaf with no sibling */
b = c;
/* Now backtrack from "b" to the correct location for "a" */
while (b != a && wb_key_ref_cmp(base + a, base + b))
b = (b - 1) / 2;
c = b; /* Where "a" belongs */
while (b != a) { /* Shift it into place */
b = (b - 1) / 2;
swap(base[b], base[c]);
}
}
}
static noinline int wb_flush_one_slowpath(struct btree_trans *trans,
struct btree_iter *iter,
struct btree_write_buffered_key *wb)
{
struct btree_path *path = btree_iter_path(trans, iter);
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
trans->journal_res.seq = wb->journal_seq;
return bch2_trans_update(trans, iter, &wb->k,
BTREE_UPDATE_internal_snapshot_node) ?:
bch2_trans_commit(trans, NULL, NULL,
BCH_TRANS_COMMIT_no_enospc|
BCH_TRANS_COMMIT_no_check_rw|
BCH_TRANS_COMMIT_no_journal_res|
BCH_TRANS_COMMIT_journal_reclaim);
}
static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *iter,
struct btree_write_buffered_key *wb,
bool *write_locked,
bool *accounting_accumulated,
size_t *fast)
{
struct btree_path *path;
int ret;
EBUG_ON(!wb->journal_seq);
EBUG_ON(!trans->c->btree_write_buffer.flushing.pin.seq);
EBUG_ON(trans->c->btree_write_buffer.flushing.pin.seq > wb->journal_seq);
ret = bch2_btree_iter_traverse(trans, iter);
if (ret)
return ret;
if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
struct bkey u;
struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
if (k.k->type == KEY_TYPE_accounting)
bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
bkey_s_c_to_accounting(k));
}
*accounting_accumulated = true;
/*
* We can't clone a path that has write locks: unshare it now, before
* set_pos and traverse():
*/
if (btree_iter_path(trans, iter)->ref > 1)
iter->path = __bch2_btree_path_make_mut(trans, iter->path, true, _THIS_IP_);
path = btree_iter_path(trans, iter);
if (!*write_locked) {
ret = bch2_btree_node_lock_write(trans, path, &path->l[0].b->c);
if (ret)
return ret;
bch2_btree_node_prep_for_write(trans, path, path->l[0].b);
*write_locked = true;
}
if (unlikely(!bch2_btree_node_insert_fits(path->l[0].b, wb->k.k.u64s))) {
*write_locked = false;
return wb_flush_one_slowpath(trans, iter, wb);
}
EBUG_ON(!bpos_eq(wb->k.k.p, path->pos));
bch2_btree_insert_key_leaf(trans, path, &wb->k, wb->journal_seq);
(*fast)++;
return 0;
}
/*
* Update a btree with a write buffered key using the journal seq of the
* original write buffer insert.
*
* It is not safe to rejournal the key once it has been inserted into the write
* buffer because that may break recovery ordering. For example, the key may
* have already been modified in the active write buffer in a seq that comes
* before the current transaction. If we were to journal this key again and
* crash, recovery would process updates in the wrong order.
*/
static int
btree_write_buffered_insert(struct btree_trans *trans,
struct btree_write_buffered_key *wb)
{
struct btree_iter iter;
int ret;
bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
BTREE_ITER_cached|BTREE_ITER_intent);
trans->journal_res.seq = wb->journal_seq;
ret = bch2_btree_iter_traverse(trans, &iter) ?:
bch2_trans_update(trans, &iter, &wb->k,
BTREE_UPDATE_internal_snapshot_node);
bch2_trans_iter_exit(trans, &iter);
return ret;
}
static void move_keys_from_inc_to_flushing(struct btree_write_buffer *wb)
{
struct bch_fs *c = container_of(wb, struct bch_fs, btree_write_buffer);
struct journal *j = &c->journal;
if (!wb->inc.keys.nr)
return;
bch2_journal_pin_add(j, wb->inc.keys.data[0].journal_seq, &wb->flushing.pin,
bch2_btree_write_buffer_journal_flush);
darray_resize(&wb->flushing.keys, min_t(size_t, 1U << 20, wb->flushing.keys.nr + wb->inc.keys.nr));
darray_resize(&wb->sorted, wb->flushing.keys.size);
if (!wb->flushing.keys.nr && wb->sorted.size >= wb->inc.keys.nr) {
swap(wb->flushing.keys, wb->inc.keys);
goto out;
}
size_t nr = min(darray_room(wb->flushing.keys),
wb->sorted.size - wb->flushing.keys.nr);
nr = min(nr, wb->inc.keys.nr);
memcpy(&darray_top(wb->flushing.keys),
wb->inc.keys.data,
sizeof(wb->inc.keys.data[0]) * nr);
memmove(wb->inc.keys.data,
wb->inc.keys.data + nr,
sizeof(wb->inc.keys.data[0]) * (wb->inc.keys.nr - nr));
wb->flushing.keys.nr += nr;
wb->inc.keys.nr -= nr;
out:
if (!wb->inc.keys.nr)
bch2_journal_pin_drop(j, &wb->inc.pin);
else
bch2_journal_pin_update(j, wb->inc.keys.data[0].journal_seq, &wb->inc.pin,
bch2_btree_write_buffer_journal_flush);
if (j->watermark) {
spin_lock(&j->lock);
bch2_journal_set_watermark(j);
spin_unlock(&j->lock);
}
BUG_ON(wb->sorted.size < wb->flushing.keys.nr);
}
int bch2_btree_write_buffer_insert_err(struct bch_fs *c,
enum btree_id btree, struct bkey_i *k)
{
struct printbuf buf = PRINTBUF;
prt_printf(&buf, "attempting to do write buffer update on non wb btree=");
bch2_btree_id_to_text(&buf, btree);
prt_str(&buf, "\n");
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(k));
bch2_fs_inconsistent(c, "%s", buf.buf);
printbuf_exit(&buf);
return -EROFS;
}
static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
struct journal *j = &c->journal;
struct btree_write_buffer *wb = &c->btree_write_buffer;
struct btree_iter iter = {};
size_t overwritten = 0, fast = 0, slowpath = 0, could_not_insert = 0;
bool write_locked = false;
bool accounting_replay_done = test_bit(BCH_FS_accounting_replay_done, &c->flags);
int ret = 0;
ret = bch2_journal_error(&c->journal);
if (ret)
return ret;
bch2_trans_unlock(trans);
bch2_trans_begin(trans);
mutex_lock(&wb->inc.lock);
move_keys_from_inc_to_flushing(wb);
mutex_unlock(&wb->inc.lock);
for (size_t i = 0; i < wb->flushing.keys.nr; i++) {
wb->sorted.data[i].idx = i;
wb->sorted.data[i].btree = wb->flushing.keys.data[i].btree;
memcpy(&wb->sorted.data[i].pos, &wb->flushing.keys.data[i].k.k.p, sizeof(struct bpos));
}
wb->sorted.nr = wb->flushing.keys.nr;
/*
* We first sort so that we can detect and skip redundant updates, and
* then we attempt to flush in sorted btree order, as this is most
* efficient.
*
* However, since we're not flushing in the order they appear in the
* journal we won't be able to drop our journal pin until everything is
* flushed - which means this could deadlock the journal if we weren't
* passing BCH_TRANS_COMMIT_journal_reclaim. This causes the update to fail
* if it would block taking a journal reservation.
*
* If that happens, simply skip the key so we can optimistically insert
* as many keys as possible in the fast path.
*/
wb_sort(wb->sorted.data, wb->sorted.nr);
darray_for_each(wb->sorted, i) {
struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
if (unlikely(!btree_type_uses_write_buffer(k->btree))) {
ret = bch2_btree_write_buffer_insert_err(trans->c, k->btree, &k->k);
goto err;
}
for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
prefetch(&wb->flushing.keys.data[n->idx]);
BUG_ON(!k->journal_seq);
if (!accounting_replay_done &&
k->k.k.type == KEY_TYPE_accounting) {
slowpath++;
continue;
}
if (i + 1 < &darray_top(wb->sorted) &&
wb_key_eq(i, i + 1)) {
struct btree_write_buffered_key *n = &wb->flushing.keys.data[i[1].idx];
if (k->k.k.type == KEY_TYPE_accounting &&
n->k.k.type == KEY_TYPE_accounting)
bch2_accounting_accumulate(bkey_i_to_accounting(&n->k),
bkey_i_to_s_c_accounting(&k->k));
overwritten++;
n->journal_seq = min_t(u64, n->journal_seq, k->journal_seq);
k->journal_seq = 0;
continue;
}
if (write_locked) {
struct btree_path *path = btree_iter_path(trans, &iter);
if (path->btree_id != i->btree ||
bpos_gt(k->k.k.p, path->l[0].b->key.k.p)) {
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
write_locked = false;
ret = lockrestart_do(trans,
bch2_btree_iter_traverse(trans, &iter) ?:
bch2_foreground_maybe_merge(trans, iter.path, 0,
BCH_WATERMARK_reclaim|
BCH_TRANS_COMMIT_journal_reclaim|
BCH_TRANS_COMMIT_no_check_rw|
BCH_TRANS_COMMIT_no_enospc));
if (ret)
goto err;
}
}
if (!iter.path || iter.btree_id != k->btree) {
bch2_trans_iter_exit(trans, &iter);
bch2_trans_iter_init(trans, &iter, k->btree, k->k.k.p,
BTREE_ITER_intent|BTREE_ITER_all_snapshots);
}
bch2_btree_iter_set_pos(trans, &iter, k->k.k.p);
btree_iter_path(trans, &iter)->preserve = false;
bool accounting_accumulated = false;
do {
if (race_fault()) {
ret = bch_err_throw(c, journal_reclaim_would_deadlock);
break;
}
ret = wb_flush_one(trans, &iter, k, &write_locked,
&accounting_accumulated, &fast);
if (!write_locked)
bch2_trans_begin(trans);
} while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
if (!ret) {
k->journal_seq = 0;
} else if (ret == -BCH_ERR_journal_reclaim_would_deadlock) {
slowpath++;
ret = 0;
} else
break;
}
if (write_locked) {
struct btree_path *path = btree_iter_path(trans, &iter);
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
}
bch2_trans_iter_exit(trans, &iter);
if (ret)
goto err;
if (slowpath) {
/*
* Flush in the order they were present in the journal, so that
* we can release journal pins:
* The fastpath zapped the seq of keys that were successfully flushed so
* we can skip those here.
*/
trace_and_count(c, write_buffer_flush_slowpath, trans, slowpath, wb->flushing.keys.nr);
sort_nonatomic(wb->flushing.keys.data,
wb->flushing.keys.nr,
sizeof(wb->flushing.keys.data[0]),
wb_key_seq_cmp, NULL);
darray_for_each(wb->flushing.keys, i) {
if (!i->journal_seq)
continue;
if (!accounting_replay_done &&
i->k.k.type == KEY_TYPE_accounting) {
could_not_insert++;
continue;
}
if (!could_not_insert)
bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
bch2_btree_write_buffer_journal_flush);
bch2_trans_begin(trans);
ret = commit_do(trans, NULL, NULL,
BCH_WATERMARK_reclaim|
BCH_TRANS_COMMIT_journal_reclaim|
BCH_TRANS_COMMIT_no_check_rw|
BCH_TRANS_COMMIT_no_enospc|
BCH_TRANS_COMMIT_no_journal_res ,
btree_write_buffered_insert(trans, i));
if (ret)
goto err;
i->journal_seq = 0;
}
/*
* If journal replay hasn't finished with accounting keys we
* can't flush accounting keys at all - condense them and leave
* them for next time.
*
* Q: Can the write buffer overflow?
* A Shouldn't be any actual risk. It's just new accounting
* updates that the write buffer can't flush, and those are only
* going to be generated by interior btree node updates as
* journal replay has to split/rewrite nodes to make room for
* its updates.
*
* And for those new acounting updates, updates to the same
* counters get accumulated as they're flushed from the journal
* to the write buffer - see the patch for eytzingcer tree
* accumulated. So we could only overflow if the number of
* distinct counters touched somehow was very large.
*/
if (could_not_insert) {
struct btree_write_buffered_key *dst = wb->flushing.keys.data;
darray_for_each(wb->flushing.keys, i)
if (i->journal_seq)
*dst++ = *i;
wb->flushing.keys.nr = dst - wb->flushing.keys.data;
}
}
err:
if (ret || !could_not_insert) {
bch2_journal_pin_drop(j, &wb->flushing.pin);
wb->flushing.keys.nr = 0;
}
bch2_fs_fatal_err_on(ret, c, "%s", bch2_err_str(ret));
trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
return ret;
}
static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf)
{
struct journal_keys_to_wb dst;
int ret = 0;
bch2_journal_keys_to_write_buffer_start(c, &dst, le64_to_cpu(buf->data->seq));
for_each_jset_entry_type(entry, buf->data, BCH_JSET_ENTRY_write_buffer_keys) {
jset_entry_for_each_key(entry, k) {
ret = bch2_journal_key_to_wb(c, &dst, entry->btree_id, k);
if (ret)
goto out;
}
entry->type = BCH_JSET_ENTRY_btree_keys;
}
out:
ret = bch2_journal_keys_to_write_buffer_end(c, &dst) ?: ret;
return ret;
}
static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 max_seq)
{
struct journal *j = &c->journal;
struct journal_buf *buf;
bool blocked;
int ret = 0;
while (!ret && (buf = bch2_next_write_buffer_flush_journal_buf(j, max_seq, &blocked))) {
ret = bch2_journal_keys_to_write_buffer(c, buf);
if (!blocked && !ret) {
spin_lock(&j->lock);
buf->need_flush_to_write_buffer = false;
spin_unlock(&j->lock);
}
mutex_unlock(&j->buf_lock);
if (blocked) {
bch2_journal_unblock(j);
break;
}
}
return ret;
}
static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 max_seq,
bool *did_work)
{
struct bch_fs *c = trans->c;
struct btree_write_buffer *wb = &c->btree_write_buffer;
int ret = 0, fetch_from_journal_err;
do {
bch2_trans_unlock(trans);
fetch_from_journal_err = fetch_wb_keys_from_journal(c, max_seq);
*did_work |= wb->inc.keys.nr || wb->flushing.keys.nr;
/*
* On memory allocation failure, bch2_btree_write_buffer_flush_locked()
* is not guaranteed to empty wb->inc:
*/
mutex_lock(&wb->flushing.lock);
ret = bch2_btree_write_buffer_flush_locked(trans);
mutex_unlock(&wb->flushing.lock);
} while (!ret &&
(fetch_from_journal_err ||
(wb->inc.pin.seq && wb->inc.pin.seq <= max_seq) ||
(wb->flushing.pin.seq && wb->flushing.pin.seq <= max_seq)));
return ret;
}
static int bch2_btree_write_buffer_journal_flush(struct journal *j,
struct journal_entry_pin *_pin, u64 seq)
{
struct bch_fs *c = container_of(j, struct bch_fs, journal);
bool did_work = false;
return bch2_trans_run(c, btree_write_buffer_flush_seq(trans, seq, &did_work));
}
int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
bool did_work = false;
trace_and_count(c, write_buffer_flush_sync, trans, _RET_IP_);
return btree_write_buffer_flush_seq(trans, journal_cur_seq(&c->journal), &did_work);
}
/*
* The write buffer requires flushing when going RO: keys in the journal for the
* write buffer don't have a journal pin yet
*/
bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *c)
{
if (bch2_journal_error(&c->journal))
return false;
bool did_work = false;
bch2_trans_run(c, btree_write_buffer_flush_seq(trans,
journal_cur_seq(&c->journal), &did_work));
return did_work;
}
int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
struct btree_write_buffer *wb = &c->btree_write_buffer;
int ret = 0;
if (mutex_trylock(&wb->flushing.lock)) {
ret = bch2_btree_write_buffer_flush_locked(trans);
mutex_unlock(&wb->flushing.lock);
}
return ret;
}
int bch2_btree_write_buffer_tryflush(struct btree_trans *trans)
{
struct bch_fs *c = trans->c;
if (!enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_btree_write_buffer))
return bch_err_throw(c, erofs_no_writes);
int ret = bch2_btree_write_buffer_flush_nocheck_rw(trans);
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
return ret;
}
/*
* In check and repair code, when checking references to write buffer btrees we
* need to issue a flush before we have a definitive error: this issues a flush
* if this is a key we haven't yet checked.
*/
int bch2_btree_write_buffer_maybe_flush(struct btree_trans *trans,
struct bkey_s_c referring_k,
struct bkey_buf *last_flushed)
{
struct bch_fs *c = trans->c;
struct bkey_buf tmp;
int ret = 0;
bch2_bkey_buf_init(&tmp);
if (!bkey_and_val_eq(referring_k, bkey_i_to_s_c(last_flushed->k))) {
if (trace_write_buffer_maybe_flush_enabled()) {
struct printbuf buf = PRINTBUF;
bch2_bkey_val_to_text(&buf, c, referring_k);
trace_write_buffer_maybe_flush(trans, _RET_IP_, buf.buf);
printbuf_exit(&buf);
}
bch2_bkey_buf_reassemble(&tmp, c, referring_k);
if (bkey_is_btree_ptr(referring_k.k)) {
bch2_trans_unlock(trans);
bch2_btree_interior_updates_flush(c);
}
ret = bch2_btree_write_buffer_flush_sync(trans);
if (ret)
goto err;
bch2_bkey_buf_copy(last_flushed, c, tmp.k);
/* can we avoid the unconditional restart? */
trace_and_count(c, trans_restart_write_buffer_flush, trans, _RET_IP_);
ret = bch_err_throw(c, transaction_restart_write_buffer_flush);
}
err:
bch2_bkey_buf_exit(&tmp, c);
return ret;
}
static void bch2_btree_write_buffer_flush_work(struct work_struct *work)
{
struct bch_fs *c = container_of(work, struct bch_fs, btree_write_buffer.flush_work);
struct btree_write_buffer *wb = &c->btree_write_buffer;
int ret;
mutex_lock(&wb->flushing.lock);
do {
ret = bch2_trans_run(c, bch2_btree_write_buffer_flush_locked(trans));
} while (!ret && bch2_btree_write_buffer_should_flush(c));
mutex_unlock(&wb->flushing.lock);
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
}
static void wb_accounting_sort(struct btree_write_buffer *wb)
{
eytzinger0_sort(wb->accounting.data, wb->accounting.nr,
sizeof(wb->accounting.data[0]),
wb_key_cmp, NULL);
}
int bch2_accounting_key_to_wb_slowpath(struct bch_fs *c, enum btree_id btree,
struct bkey_i_accounting *k)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
struct btree_write_buffered_key new = { .btree = btree };
bkey_copy(&new.k, &k->k_i);
int ret = darray_push(&wb->accounting, new);
if (ret)
return ret;
wb_accounting_sort(wb);
return 0;
}
int bch2_journal_key_to_wb_slowpath(struct bch_fs *c,
struct journal_keys_to_wb *dst,
enum btree_id btree, struct bkey_i *k)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
int ret;
retry:
ret = darray_make_room_gfp(&dst->wb->keys, 1, GFP_KERNEL);
if (!ret && dst->wb == &wb->flushing)
ret = darray_resize(&wb->sorted, wb->flushing.keys.size);
if (unlikely(ret)) {
if (dst->wb == &c->btree_write_buffer.flushing) {
mutex_unlock(&dst->wb->lock);
dst->wb = &c->btree_write_buffer.inc;
bch2_journal_pin_add(&c->journal, dst->seq, &dst->wb->pin,
bch2_btree_write_buffer_journal_flush);
goto retry;
}
return ret;
}
dst->room = darray_room(dst->wb->keys);
if (dst->wb == &wb->flushing)
dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
BUG_ON(!dst->room);
BUG_ON(!dst->seq);
struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
wb_k->journal_seq = dst->seq;
wb_k->btree = btree;
bkey_copy(&wb_k->k, k);
dst->wb->keys.nr++;
dst->room--;
return 0;
}
void bch2_journal_keys_to_write_buffer_start(struct bch_fs *c, struct journal_keys_to_wb *dst, u64 seq)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
if (mutex_trylock(&wb->flushing.lock)) {
mutex_lock(&wb->inc.lock);
move_keys_from_inc_to_flushing(wb);
/*
* Attempt to skip wb->inc, and add keys directly to
* wb->flushing, saving us a copy later:
*/
if (!wb->inc.keys.nr) {
dst->wb = &wb->flushing;
} else {
mutex_unlock(&wb->flushing.lock);
dst->wb = &wb->inc;
}
} else {
mutex_lock(&wb->inc.lock);
dst->wb = &wb->inc;
}
dst->room = darray_room(dst->wb->keys);
if (dst->wb == &wb->flushing)
dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
dst->seq = seq;
bch2_journal_pin_add(&c->journal, seq, &dst->wb->pin,
bch2_btree_write_buffer_journal_flush);
darray_for_each(wb->accounting, i)
memset(&i->k.v, 0, bkey_val_bytes(&i->k.k));
}
int bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys_to_wb *dst)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
unsigned live_accounting_keys = 0;
int ret = 0;
darray_for_each(wb->accounting, i)
if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&i->k))) {
i->journal_seq = dst->seq;
live_accounting_keys++;
ret = __bch2_journal_key_to_wb(c, dst, i->btree, &i->k);
if (ret)
break;
}
if (live_accounting_keys * 2 < wb->accounting.nr) {
struct btree_write_buffered_key *dst = wb->accounting.data;
darray_for_each(wb->accounting, src)
if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&src->k)))
*dst++ = *src;
wb->accounting.nr = dst - wb->accounting.data;
wb_accounting_sort(wb);
}
if (!dst->wb->keys.nr)
bch2_journal_pin_drop(&c->journal, &dst->wb->pin);
if (bch2_btree_write_buffer_should_flush(c) &&
__enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_btree_write_buffer) &&
!queue_work(system_dfl_wq, &c->btree_write_buffer.flush_work))
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
if (dst->wb == &wb->flushing)
mutex_unlock(&wb->flushing.lock);
mutex_unlock(&wb->inc.lock);
return ret;
}
static int wb_keys_resize(struct btree_write_buffer_keys *wb, size_t new_size)
{
if (wb->keys.size >= new_size)
return 0;
if (!mutex_trylock(&wb->lock))
return -EINTR;
int ret = darray_resize(&wb->keys, new_size);
mutex_unlock(&wb->lock);
return ret;
}
int bch2_btree_write_buffer_resize(struct bch_fs *c, size_t new_size)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
return wb_keys_resize(&wb->flushing, new_size) ?:
wb_keys_resize(&wb->inc, new_size);
}
void bch2_fs_btree_write_buffer_exit(struct bch_fs *c)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
BUG_ON((wb->inc.keys.nr || wb->flushing.keys.nr) &&
!bch2_journal_error(&c->journal));
darray_exit(&wb->accounting);
darray_exit(&wb->sorted);
darray_exit(&wb->flushing.keys);
darray_exit(&wb->inc.keys);
}
void bch2_fs_btree_write_buffer_init_early(struct bch_fs *c)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
mutex_init(&wb->inc.lock);
mutex_init(&wb->flushing.lock);
INIT_WORK(&wb->flush_work, bch2_btree_write_buffer_flush_work);
}
int bch2_fs_btree_write_buffer_init(struct bch_fs *c)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
/* Will be resized by journal as needed: */
unsigned initial_size = 1 << 16;
return darray_make_room(&wb->inc.keys, initial_size) ?:
darray_make_room(&wb->flushing.keys, initial_size) ?:
darray_make_room(&wb->sorted, initial_size);
}

View File

@ -1,113 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_WRITE_BUFFER_H
#define _BCACHEFS_BTREE_WRITE_BUFFER_H
#include "bkey.h"
#include "disk_accounting.h"
static inline bool bch2_btree_write_buffer_should_flush(struct bch_fs *c)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
return wb->inc.keys.nr + wb->flushing.keys.nr > wb->inc.keys.size / 4;
}
static inline bool bch2_btree_write_buffer_must_wait(struct bch_fs *c)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
return wb->inc.keys.nr > wb->inc.keys.size * 3 / 4;
}
struct btree_trans;
int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *);
int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *);
int bch2_btree_write_buffer_tryflush(struct btree_trans *);
struct bkey_buf;
int bch2_btree_write_buffer_maybe_flush(struct btree_trans *, struct bkey_s_c, struct bkey_buf *);
struct journal_keys_to_wb {
struct btree_write_buffer_keys *wb;
size_t room;
u64 seq;
};
static inline int wb_key_cmp(const void *_l, const void *_r)
{
const struct btree_write_buffered_key *l = _l;
const struct btree_write_buffered_key *r = _r;
return cmp_int(l->btree, r->btree) ?: bpos_cmp(l->k.k.p, r->k.k.p);
}
int bch2_accounting_key_to_wb_slowpath(struct bch_fs *,
enum btree_id, struct bkey_i_accounting *);
static inline int bch2_accounting_key_to_wb(struct bch_fs *c,
enum btree_id btree, struct bkey_i_accounting *k)
{
struct btree_write_buffer *wb = &c->btree_write_buffer;
struct btree_write_buffered_key search;
search.btree = btree;
search.k.k.p = k->k.p;
unsigned idx = eytzinger0_find(wb->accounting.data, wb->accounting.nr,
sizeof(wb->accounting.data[0]),
wb_key_cmp, &search);
if (idx >= wb->accounting.nr)
return bch2_accounting_key_to_wb_slowpath(c, btree, k);
struct bkey_i_accounting *dst = bkey_i_to_accounting(&wb->accounting.data[idx].k);
bch2_accounting_accumulate(dst, accounting_i_to_s_c(k));
return 0;
}
int bch2_journal_key_to_wb_slowpath(struct bch_fs *,
struct journal_keys_to_wb *,
enum btree_id, struct bkey_i *);
static inline int __bch2_journal_key_to_wb(struct bch_fs *c,
struct journal_keys_to_wb *dst,
enum btree_id btree, struct bkey_i *k)
{
if (unlikely(!dst->room))
return bch2_journal_key_to_wb_slowpath(c, dst, btree, k);
struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
wb_k->journal_seq = dst->seq;
wb_k->btree = btree;
bkey_copy(&wb_k->k, k);
dst->wb->keys.nr++;
dst->room--;
return 0;
}
static inline int bch2_journal_key_to_wb(struct bch_fs *c,
struct journal_keys_to_wb *dst,
enum btree_id btree, struct bkey_i *k)
{
if (unlikely(!btree_type_uses_write_buffer(btree))) {
int ret = bch2_btree_write_buffer_insert_err(c, btree, k);
dump_stack();
return ret;
}
EBUG_ON(!dst->seq);
return k->k.type == KEY_TYPE_accounting
? bch2_accounting_key_to_wb(c, btree, bkey_i_to_accounting(k))
: __bch2_journal_key_to_wb(c, dst, btree, k);
}
void bch2_journal_keys_to_write_buffer_start(struct bch_fs *, struct journal_keys_to_wb *, u64);
int bch2_journal_keys_to_write_buffer_end(struct bch_fs *, struct journal_keys_to_wb *);
int bch2_btree_write_buffer_resize(struct bch_fs *, size_t);
void bch2_fs_btree_write_buffer_exit(struct bch_fs *);
void bch2_fs_btree_write_buffer_init_early(struct bch_fs *);
int bch2_fs_btree_write_buffer_init(struct bch_fs *);
#endif /* _BCACHEFS_BTREE_WRITE_BUFFER_H */

View File

@ -1,59 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
#define _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
#include "darray.h"
#include "journal_types.h"
#define BTREE_WRITE_BUFERED_VAL_U64s_MAX 4
#define BTREE_WRITE_BUFERED_U64s_MAX (BKEY_U64s + BTREE_WRITE_BUFERED_VAL_U64s_MAX)
struct wb_key_ref {
union {
struct {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
unsigned idx:24;
u8 pos[sizeof(struct bpos)];
enum btree_id btree:8;
#else
enum btree_id btree:8;
u8 pos[sizeof(struct bpos)];
unsigned idx:24;
#endif
} __packed;
struct {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
u64 lo;
u64 mi;
u64 hi;
#else
u64 hi;
u64 mi;
u64 lo;
#endif
};
};
};
struct btree_write_buffered_key {
enum btree_id btree:8;
u64 journal_seq:56;
__BKEY_PADDED(k, BTREE_WRITE_BUFERED_VAL_U64s_MAX);
};
struct btree_write_buffer_keys {
DARRAY(struct btree_write_buffered_key) keys;
struct journal_entry_pin pin;
struct mutex lock;
};
struct btree_write_buffer {
DARRAY(struct wb_key_ref) sorted;
struct btree_write_buffer_keys inc;
struct btree_write_buffer_keys flushing;
struct work_struct flush_work;
DARRAY(struct btree_write_buffered_key) accounting;
};
#endif /* _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,369 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* Code for manipulating bucket marks for garbage collection.
*
* Copyright 2014 Datera, Inc.
*/
#ifndef _BUCKETS_H
#define _BUCKETS_H
#include "buckets_types.h"
#include "extents.h"
#include "sb-members.h"
static inline u64 sector_to_bucket(const struct bch_dev *ca, sector_t s)
{
return div_u64(s, ca->mi.bucket_size);
}
static inline sector_t bucket_to_sector(const struct bch_dev *ca, size_t b)
{
return ((sector_t) b) * ca->mi.bucket_size;
}
static inline sector_t bucket_remainder(const struct bch_dev *ca, sector_t s)
{
u32 remainder;
div_u64_rem(s, ca->mi.bucket_size, &remainder);
return remainder;
}
static inline u64 sector_to_bucket_and_offset(const struct bch_dev *ca, sector_t s, u32 *offset)
{
return div_u64_rem(s, ca->mi.bucket_size, offset);
}
#define for_each_bucket(_b, _buckets) \
for (_b = (_buckets)->b + (_buckets)->first_bucket; \
_b < (_buckets)->b + (_buckets)->nbuckets; _b++)
static inline void bucket_unlock(struct bucket *b)
{
BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &b->lock);
smp_mb__after_atomic();
wake_up_bit((void *) &b->lock, BUCKET_LOCK_BITNR);
}
static inline void bucket_lock(struct bucket *b)
{
wait_on_bit_lock((void *) &b->lock, BUCKET_LOCK_BITNR,
TASK_UNINTERRUPTIBLE);
}
static inline struct bucket *gc_bucket(struct bch_dev *ca, size_t b)
{
return bucket_valid(ca, b)
? genradix_ptr(&ca->buckets_gc, b)
: NULL;
}
static inline struct bucket_gens *bucket_gens(struct bch_dev *ca)
{
return rcu_dereference_check(ca->bucket_gens,
lockdep_is_held(&ca->fs->state_lock));
}
static inline u8 *bucket_gen(struct bch_dev *ca, size_t b)
{
struct bucket_gens *gens = bucket_gens(ca);
if (b - gens->first_bucket >= gens->nbuckets_minus_first)
return NULL;
return gens->b + b;
}
static inline int bucket_gen_get_rcu(struct bch_dev *ca, size_t b)
{
u8 *gen = bucket_gen(ca, b);
return gen ? *gen : -1;
}
static inline int bucket_gen_get(struct bch_dev *ca, size_t b)
{
guard(rcu)();
return bucket_gen_get_rcu(ca, b);
}
static inline size_t PTR_BUCKET_NR(const struct bch_dev *ca,
const struct bch_extent_ptr *ptr)
{
return sector_to_bucket(ca, ptr->offset);
}
static inline struct bpos PTR_BUCKET_POS(const struct bch_dev *ca,
const struct bch_extent_ptr *ptr)
{
return POS(ptr->dev, PTR_BUCKET_NR(ca, ptr));
}
static inline struct bpos PTR_BUCKET_POS_OFFSET(const struct bch_dev *ca,
const struct bch_extent_ptr *ptr,
u32 *bucket_offset)
{
return POS(ptr->dev, sector_to_bucket_and_offset(ca, ptr->offset, bucket_offset));
}
static inline struct bucket *PTR_GC_BUCKET(struct bch_dev *ca,
const struct bch_extent_ptr *ptr)
{
return gc_bucket(ca, PTR_BUCKET_NR(ca, ptr));
}
static inline enum bch_data_type ptr_data_type(const struct bkey *k,
const struct bch_extent_ptr *ptr)
{
if (bkey_is_btree_ptr(k))
return BCH_DATA_btree;
return ptr->cached ? BCH_DATA_cached : BCH_DATA_user;
}
static inline s64 ptr_disk_sectors(s64 sectors, struct extent_ptr_decoded p)
{
EBUG_ON(sectors < 0);
return crc_is_compressed(p.crc)
? DIV_ROUND_UP_ULL(sectors * p.crc.compressed_size,
p.crc.uncompressed_size)
: sectors;
}
static inline int gen_cmp(u8 a, u8 b)
{
return (s8) (a - b);
}
static inline int gen_after(u8 a, u8 b)
{
return max(0, gen_cmp(a, b));
}
static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
{
int gen = bucket_gen_get_rcu(ca, PTR_BUCKET_NR(ca, ptr));
return gen < 0 ? gen : gen_after(gen, ptr->gen);
}
/**
* dev_ptr_stale() - check if a pointer points into a bucket that has been
* invalidated.
*/
static inline int dev_ptr_stale(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
{
guard(rcu)();
return dev_ptr_stale_rcu(ca, ptr);
}
/* Device usage: */
void bch2_dev_usage_read_fast(struct bch_dev *, struct bch_dev_usage *);
static inline struct bch_dev_usage bch2_dev_usage_read(struct bch_dev *ca)
{
struct bch_dev_usage ret;
bch2_dev_usage_read_fast(ca, &ret);
return ret;
}
void bch2_dev_usage_full_read_fast(struct bch_dev *, struct bch_dev_usage_full *);
static inline struct bch_dev_usage_full bch2_dev_usage_full_read(struct bch_dev *ca)
{
struct bch_dev_usage_full ret;
bch2_dev_usage_full_read_fast(ca, &ret);
return ret;
}
void bch2_dev_usage_to_text(struct printbuf *, struct bch_dev *, struct bch_dev_usage_full *);
static inline u64 bch2_dev_buckets_reserved(struct bch_dev *ca, enum bch_watermark watermark)
{
s64 reserved = 0;
switch (watermark) {
case BCH_WATERMARK_NR:
BUG();
case BCH_WATERMARK_stripe:
reserved += ca->mi.nbuckets >> 6;
fallthrough;
case BCH_WATERMARK_normal:
reserved += ca->mi.nbuckets >> 6;
fallthrough;
case BCH_WATERMARK_copygc:
reserved += ca->nr_btree_reserve;
fallthrough;
case BCH_WATERMARK_btree:
reserved += ca->nr_btree_reserve;
fallthrough;
case BCH_WATERMARK_btree_copygc:
case BCH_WATERMARK_reclaim:
case BCH_WATERMARK_interior_updates:
break;
}
return reserved;
}
static inline u64 dev_buckets_free(struct bch_dev *ca,
struct bch_dev_usage usage,
enum bch_watermark watermark)
{
return max_t(s64, 0,
usage.buckets[BCH_DATA_free]-
ca->nr_open_buckets -
bch2_dev_buckets_reserved(ca, watermark));
}
static inline u64 __dev_buckets_available(struct bch_dev *ca,
struct bch_dev_usage usage,
enum bch_watermark watermark)
{
return max_t(s64, 0,
usage.buckets[BCH_DATA_free]
+ usage.buckets[BCH_DATA_cached]
+ usage.buckets[BCH_DATA_need_gc_gens]
+ usage.buckets[BCH_DATA_need_discard]
- ca->nr_open_buckets
- bch2_dev_buckets_reserved(ca, watermark));
}
static inline u64 dev_buckets_available(struct bch_dev *ca,
enum bch_watermark watermark)
{
return __dev_buckets_available(ca, bch2_dev_usage_read(ca), watermark);
}
/* Filesystem usage: */
struct bch_fs_usage_short
bch2_fs_usage_read_short(struct bch_fs *);
int bch2_bucket_ref_update(struct btree_trans *, struct bch_dev *,
struct bkey_s_c, const struct bch_extent_ptr *,
s64, enum bch_data_type, u8, u8, u32 *);
int bch2_check_fix_ptrs(struct btree_trans *,
enum btree_id, unsigned, struct bkey_s_c,
enum btree_iter_update_trigger_flags);
int bch2_trigger_extent(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_s,
enum btree_iter_update_trigger_flags);
int bch2_trigger_reservation(struct btree_trans *, enum btree_id, unsigned,
struct bkey_s_c, struct bkey_s,
enum btree_iter_update_trigger_flags);
#define trigger_run_overwrite_then_insert(_fn, _trans, _btree_id, _level, _old, _new, _flags)\
({ \
int ret = 0; \
\
if (_old.k->type) \
ret = _fn(_trans, _btree_id, _level, _old, _flags & ~BTREE_TRIGGER_insert); \
if (!ret && _new.k->type) \
ret = _fn(_trans, _btree_id, _level, _new.s_c, _flags & ~BTREE_TRIGGER_overwrite);\
ret; \
})
void bch2_trans_account_disk_usage_change(struct btree_trans *);
int bch2_trans_mark_metadata_bucket(struct btree_trans *, struct bch_dev *, u64,
enum bch_data_type, unsigned,
enum btree_iter_update_trigger_flags);
int bch2_trans_mark_dev_sb(struct bch_fs *, struct bch_dev *,
enum btree_iter_update_trigger_flags);
int bch2_trans_mark_dev_sbs_flags(struct bch_fs *,
enum btree_iter_update_trigger_flags);
int bch2_trans_mark_dev_sbs(struct bch_fs *);
bool bch2_is_superblock_bucket(struct bch_dev *, u64);
static inline const char *bch2_data_type_str(enum bch_data_type type)
{
return type < BCH_DATA_NR
? __bch2_data_types[type]
: "(invalid data type)";
}
/* disk reservations: */
static inline void bch2_disk_reservation_put(struct bch_fs *c,
struct disk_reservation *res)
{
if (res->sectors) {
this_cpu_sub(*c->online_reserved, res->sectors);
res->sectors = 0;
}
}
enum bch_reservation_flags {
BCH_DISK_RESERVATION_NOFAIL = 1 << 0,
BCH_DISK_RESERVATION_PARTIAL = 1 << 1,
};
int __bch2_disk_reservation_add(struct bch_fs *, struct disk_reservation *,
u64, enum bch_reservation_flags);
static inline int bch2_disk_reservation_add(struct bch_fs *c, struct disk_reservation *res,
u64 sectors, enum bch_reservation_flags flags)
{
#ifdef __KERNEL__
u64 old, new;
old = this_cpu_read(c->pcpu->sectors_available);
do {
if (sectors > old)
return __bch2_disk_reservation_add(c, res, sectors, flags);
new = old - sectors;
} while (!this_cpu_try_cmpxchg(c->pcpu->sectors_available, &old, new));
this_cpu_add(*c->online_reserved, sectors);
res->sectors += sectors;
return 0;
#else
return __bch2_disk_reservation_add(c, res, sectors, flags);
#endif
}
static inline struct disk_reservation
bch2_disk_reservation_init(struct bch_fs *c, unsigned nr_replicas)
{
return (struct disk_reservation) {
.sectors = 0,
#if 0
/* not used yet: */
.gen = c->capacity_gen,
#endif
.nr_replicas = nr_replicas,
};
}
static inline int bch2_disk_reservation_get(struct bch_fs *c,
struct disk_reservation *res,
u64 sectors, unsigned nr_replicas,
int flags)
{
*res = bch2_disk_reservation_init(c, nr_replicas);
return bch2_disk_reservation_add(c, res, sectors * nr_replicas, flags);
}
#define RESERVE_FACTOR 6
static inline u64 avail_factor(u64 r)
{
return div_u64(r << RESERVE_FACTOR, (1 << RESERVE_FACTOR) + 1);
}
void bch2_buckets_nouse_free(struct bch_fs *);
int bch2_buckets_nouse_alloc(struct bch_fs *);
int bch2_dev_buckets_resize(struct bch_fs *, struct bch_dev *, u64);
void bch2_dev_buckets_free(struct bch_dev *);
int bch2_dev_buckets_alloc(struct bch_fs *, struct bch_dev *);
#endif /* _BUCKETS_H */

View File

@ -1,100 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BUCKETS_TYPES_H
#define _BUCKETS_TYPES_H
#include "bcachefs_format.h"
#include "util.h"
#define BUCKET_JOURNAL_SEQ_BITS 16
/*
* Ugly hack alert:
*
* We need to cram a spinlock in a single byte, because that's what we have left
* in struct bucket, and we care about the size of these - during fsck, we need
* in memory state for every single bucket on every device.
*
* We used to do
* while (xchg(&b->lock, 1) cpu_relax();
* but, it turns out not all architectures support xchg on a single byte.
*
* So now we use bit_spin_lock(), with fun games since we can't burn a whole
* ulong for this - we just need to make sure the lock bit always ends up in the
* first byte.
*/
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define BUCKET_LOCK_BITNR 0
#else
#define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1)
#endif
union ulong_byte_assert {
ulong ulong;
u8 byte;
};
struct bucket {
u8 lock;
u8 gen_valid:1;
u8 data_type:7;
u8 gen;
u8 stripe_redundancy;
u32 stripe;
u32 dirty_sectors;
u32 cached_sectors;
u32 stripe_sectors;
} __aligned(sizeof(long));
struct bucket_gens {
struct rcu_head rcu;
u16 first_bucket;
size_t nbuckets;
size_t nbuckets_minus_first;
u8 b[] __counted_by(nbuckets);
};
/* Only info on bucket countns: */
struct bch_dev_usage {
u64 buckets[BCH_DATA_NR];
};
struct bch_dev_usage_full {
struct bch_dev_usage_type {
u64 buckets;
u64 sectors; /* _compressed_ sectors: */
/*
* XXX
* Why do we have this? Isn't it just buckets * bucket_size -
* sectors?
*/
u64 fragmented;
} d[BCH_DATA_NR];
};
struct bch_fs_usage_base {
u64 hidden;
u64 btree;
u64 data;
u64 cached;
u64 reserved;
u64 nr_inodes;
};
struct bch_fs_usage_short {
u64 capacity;
u64 used;
u64 free;
u64 nr_inodes;
};
/*
* A reservation for space on disk:
*/
struct disk_reservation {
u64 sectors;
u32 gen;
unsigned nr_replicas;
};
#endif /* _BUCKETS_TYPES_H */

View File

@ -1,174 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "buckets_waiting_for_journal.h"
#include <linux/hash.h>
#include <linux/random.h>
static inline struct bucket_hashed *
bucket_hash(struct buckets_waiting_for_journal_table *t,
unsigned hash_seed_idx, u64 dev_bucket)
{
return t->d + hash_64(dev_bucket ^ t->hash_seeds[hash_seed_idx], t->bits);
}
static void bucket_table_init(struct buckets_waiting_for_journal_table *t, size_t bits)
{
unsigned i;
t->bits = bits;
for (i = 0; i < ARRAY_SIZE(t->hash_seeds); i++)
get_random_bytes(&t->hash_seeds[i], sizeof(t->hash_seeds[i]));
memset(t->d, 0, sizeof(t->d[0]) << t->bits);
}
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *b,
unsigned dev, u64 bucket)
{
struct buckets_waiting_for_journal_table *t;
u64 dev_bucket = (u64) dev << 56 | bucket;
u64 ret = 0;
mutex_lock(&b->lock);
t = b->t;
for (unsigned i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
struct bucket_hashed *h = bucket_hash(t, i, dev_bucket);
if (h->dev_bucket == dev_bucket) {
ret = h->journal_seq;
break;
}
}
mutex_unlock(&b->lock);
return ret;
}
static bool bucket_table_insert(struct buckets_waiting_for_journal_table *t,
struct bucket_hashed *new,
u64 flushed_seq)
{
struct bucket_hashed *last_evicted = NULL;
unsigned tries, i;
for (tries = 0; tries < 10; tries++) {
struct bucket_hashed *old, *victim = NULL;
for (i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
old = bucket_hash(t, i, new->dev_bucket);
if (old->dev_bucket == new->dev_bucket ||
old->journal_seq <= flushed_seq) {
*old = *new;
return true;
}
if (last_evicted != old)
victim = old;
}
/* hashed to same slot 3 times: */
if (!victim)
break;
/* Failed to find an empty slot: */
swap(*new, *victim);
last_evicted = victim;
}
return false;
}
int bch2_set_bucket_needs_journal_commit(struct buckets_waiting_for_journal *b,
u64 flushed_seq,
unsigned dev, u64 bucket,
u64 journal_seq)
{
struct buckets_waiting_for_journal_table *t, *n;
struct bucket_hashed tmp, new = {
.dev_bucket = (u64) dev << 56 | bucket,
.journal_seq = journal_seq,
};
size_t i, size, new_bits, nr_elements = 1, nr_rehashes = 0, nr_rehashes_this_size = 0;
int ret = 0;
mutex_lock(&b->lock);
if (likely(bucket_table_insert(b->t, &new, flushed_seq)))
goto out;
t = b->t;
size = 1UL << t->bits;
for (i = 0; i < size; i++)
nr_elements += t->d[i].journal_seq > flushed_seq;
new_bits = ilog2(roundup_pow_of_two(nr_elements * 3));
realloc:
n = kvmalloc(sizeof(*n) + (sizeof(n->d[0]) << new_bits), GFP_KERNEL);
if (!n) {
struct bch_fs *c = container_of(b, struct bch_fs, buckets_waiting_for_journal);
ret = bch_err_throw(c, ENOMEM_buckets_waiting_for_journal_set);
goto out;
}
retry_rehash:
if (nr_rehashes_this_size == 3) {
new_bits++;
nr_rehashes_this_size = 0;
kvfree(n);
goto realloc;
}
nr_rehashes++;
nr_rehashes_this_size++;
bucket_table_init(n, new_bits);
tmp = new;
BUG_ON(!bucket_table_insert(n, &tmp, flushed_seq));
for (i = 0; i < 1UL << t->bits; i++) {
if (t->d[i].journal_seq <= flushed_seq)
continue;
tmp = t->d[i];
if (!bucket_table_insert(n, &tmp, flushed_seq))
goto retry_rehash;
}
b->t = n;
kvfree(t);
pr_debug("took %zu rehashes, table at %zu/%lu elements",
nr_rehashes, nr_elements, 1UL << b->t->bits);
out:
mutex_unlock(&b->lock);
return ret;
}
void bch2_fs_buckets_waiting_for_journal_exit(struct bch_fs *c)
{
struct buckets_waiting_for_journal *b = &c->buckets_waiting_for_journal;
kvfree(b->t);
}
#define INITIAL_TABLE_BITS 3
int bch2_fs_buckets_waiting_for_journal_init(struct bch_fs *c)
{
struct buckets_waiting_for_journal *b = &c->buckets_waiting_for_journal;
mutex_init(&b->lock);
b->t = kvmalloc(sizeof(*b->t) +
(sizeof(b->t->d[0]) << INITIAL_TABLE_BITS), GFP_KERNEL);
if (!b->t)
return -BCH_ERR_ENOMEM_buckets_waiting_for_journal_init;
bucket_table_init(b->t, INITIAL_TABLE_BITS);
return 0;
}

View File

@ -1,15 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BUCKETS_WAITING_FOR_JOURNAL_H
#define _BUCKETS_WAITING_FOR_JOURNAL_H
#include "buckets_waiting_for_journal_types.h"
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *,
unsigned, u64);
int bch2_set_bucket_needs_journal_commit(struct buckets_waiting_for_journal *,
u64, unsigned, u64, u64);
void bch2_fs_buckets_waiting_for_journal_exit(struct bch_fs *);
int bch2_fs_buckets_waiting_for_journal_init(struct bch_fs *);
#endif /* _BUCKETS_WAITING_FOR_JOURNAL_H */

View File

@ -1,23 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H
#define _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H
#include <linux/siphash.h>
struct bucket_hashed {
u64 dev_bucket;
u64 journal_seq;
};
struct buckets_waiting_for_journal_table {
unsigned bits;
u64 hash_seeds[3];
struct bucket_hashed d[];
};
struct buckets_waiting_for_journal {
struct mutex lock;
struct buckets_waiting_for_journal_table *t;
};
#endif /* _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H */

View File

@ -1,843 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#ifndef NO_BCACHEFS_CHARDEV
#include "bcachefs.h"
#include "bcachefs_ioctl.h"
#include "buckets.h"
#include "chardev.h"
#include "disk_accounting.h"
#include "fsck.h"
#include "journal.h"
#include "move.h"
#include "recovery_passes.h"
#include "replicas.h"
#include "sb-counters.h"
#include "super-io.h"
#include "thread_with_file.h"
#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/fs.h>
#include <linux/ioctl.h>
#include <linux/major.h>
#include <linux/sched/task.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
/* returns with ref on ca->ref */
static struct bch_dev *bch2_device_lookup(struct bch_fs *c, u64 dev,
unsigned flags)
{
struct bch_dev *ca;
if (flags & BCH_BY_INDEX) {
if (dev >= c->sb.nr_devices)
return ERR_PTR(-EINVAL);
ca = bch2_dev_tryget_noerror(c, dev);
if (!ca)
return ERR_PTR(-EINVAL);
} else {
char *path;
path = strndup_user((const char __user *)
(unsigned long) dev, PATH_MAX);
if (IS_ERR(path))
return ERR_CAST(path);
ca = bch2_dev_lookup(c, path);
kfree(path);
}
return ca;
}
#if 0
static long bch2_ioctl_assemble(struct bch_ioctl_assemble __user *user_arg)
{
struct bch_ioctl_assemble arg;
struct bch_fs *c;
u64 *user_devs = NULL;
char **devs = NULL;
unsigned i;
int ret = -EFAULT;
if (copy_from_user(&arg, user_arg, sizeof(arg)))
return -EFAULT;
if (arg.flags || arg.pad)
return -EINVAL;
user_devs = kmalloc_array(arg.nr_devs, sizeof(u64), GFP_KERNEL);
if (!user_devs)
return -ENOMEM;
devs = kcalloc(arg.nr_devs, sizeof(char *), GFP_KERNEL);
if (copy_from_user(user_devs, user_arg->devs,
sizeof(u64) * arg.nr_devs))
goto err;
for (i = 0; i < arg.nr_devs; i++) {
devs[i] = strndup_user((const char __user *)(unsigned long)
user_devs[i],
PATH_MAX);
ret= PTR_ERR_OR_ZERO(devs[i]);
if (ret)
goto err;
}
c = bch2_fs_open(devs, arg.nr_devs, bch2_opts_empty());
ret = PTR_ERR_OR_ZERO(c);
if (!ret)
closure_put(&c->cl);
err:
if (devs)
for (i = 0; i < arg.nr_devs; i++)
kfree(devs[i]);
kfree(devs);
return ret;
}
static long bch2_ioctl_incremental(struct bch_ioctl_incremental __user *user_arg)
{
struct bch_ioctl_incremental arg;
const char *err;
char *path;
if (copy_from_user(&arg, user_arg, sizeof(arg)))
return -EFAULT;
if (arg.flags || arg.pad)
return -EINVAL;
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
ret = PTR_ERR_OR_ZERO(path);
if (ret)
return ret;
err = bch2_fs_open_incremental(path);
kfree(path);
if (err) {
pr_err("Could not register bcachefs devices: %s", err);
return -EINVAL;
}
return 0;
}
#endif
static long bch2_global_ioctl(unsigned cmd, void __user *arg)
{
long ret;
switch (cmd) {
#if 0
case BCH_IOCTL_ASSEMBLE:
return bch2_ioctl_assemble(arg);
case BCH_IOCTL_INCREMENTAL:
return bch2_ioctl_incremental(arg);
#endif
case BCH_IOCTL_FSCK_OFFLINE: {
ret = bch2_ioctl_fsck_offline(arg);
break;
}
default:
ret = -ENOTTY;
break;
}
if (ret < 0)
ret = bch2_err_class(ret);
return ret;
}
static long bch2_ioctl_query_uuid(struct bch_fs *c,
struct bch_ioctl_query_uuid __user *user_arg)
{
return copy_to_user_errcode(&user_arg->uuid, &c->sb.user_uuid,
sizeof(c->sb.user_uuid));
}
#if 0
static long bch2_ioctl_start(struct bch_fs *c, struct bch_ioctl_start arg)
{
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (arg.flags || arg.pad)
return -EINVAL;
return bch2_fs_start(c);
}
static long bch2_ioctl_stop(struct bch_fs *c)
{
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
bch2_fs_stop(c);
return 0;
}
#endif
static long bch2_ioctl_disk_add(struct bch_fs *c, struct bch_ioctl_disk arg)
{
char *path;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (arg.flags || arg.pad)
return -EINVAL;
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
ret = PTR_ERR_OR_ZERO(path);
if (ret)
return ret;
ret = bch2_dev_add(c, path);
if (!IS_ERR(path))
kfree(path);
return ret;
}
static long bch2_ioctl_disk_remove(struct bch_fs *c, struct bch_ioctl_disk arg)
{
struct bch_dev *ca;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
BCH_FORCE_IF_METADATA_LOST|
BCH_FORCE_IF_DEGRADED|
BCH_BY_INDEX)) ||
arg.pad)
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
return bch2_dev_remove(c, ca, arg.flags);
}
static long bch2_ioctl_disk_online(struct bch_fs *c, struct bch_ioctl_disk arg)
{
char *path;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (arg.flags || arg.pad)
return -EINVAL;
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
ret = PTR_ERR_OR_ZERO(path);
if (ret)
return ret;
ret = bch2_dev_online(c, path);
kfree(path);
return ret;
}
static long bch2_ioctl_disk_offline(struct bch_fs *c, struct bch_ioctl_disk arg)
{
struct bch_dev *ca;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
BCH_FORCE_IF_METADATA_LOST|
BCH_FORCE_IF_DEGRADED|
BCH_BY_INDEX)) ||
arg.pad)
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
ret = bch2_dev_offline(c, ca, arg.flags);
bch2_dev_put(ca);
return ret;
}
static long bch2_ioctl_disk_set_state(struct bch_fs *c,
struct bch_ioctl_disk_set_state arg)
{
struct bch_dev *ca;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
BCH_FORCE_IF_METADATA_LOST|
BCH_FORCE_IF_DEGRADED|
BCH_BY_INDEX)) ||
arg.pad[0] || arg.pad[1] || arg.pad[2] ||
arg.new_state >= BCH_MEMBER_STATE_NR)
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
ret = bch2_dev_set_state(c, ca, arg.new_state, arg.flags);
if (ret)
bch_err(c, "Error setting device state: %s", bch2_err_str(ret));
bch2_dev_put(ca);
return ret;
}
struct bch_data_ctx {
struct thread_with_file thr;
struct bch_fs *c;
struct bch_ioctl_data arg;
struct bch_move_stats stats;
};
static int bch2_data_thread(void *arg)
{
struct bch_data_ctx *ctx = container_of(arg, struct bch_data_ctx, thr);
ctx->thr.ret = bch2_data_job(ctx->c, &ctx->stats, ctx->arg);
if (ctx->thr.ret == -BCH_ERR_device_offline)
ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_device_offline;
else {
ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_done;
ctx->stats.data_type = (int) DATA_PROGRESS_DATA_TYPE_done;
}
enumerated_ref_put(&ctx->c->writes, BCH_WRITE_REF_ioctl_data);
return 0;
}
static int bch2_data_job_release(struct inode *inode, struct file *file)
{
struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
bch2_thread_with_file_exit(&ctx->thr);
kfree(ctx);
return 0;
}
static ssize_t bch2_data_job_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
{
struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
struct bch_fs *c = ctx->c;
struct bch_ioctl_data_event e = {
.type = BCH_DATA_EVENT_PROGRESS,
.ret = ctx->stats.ret,
.p.data_type = ctx->stats.data_type,
.p.btree_id = ctx->stats.pos.btree,
.p.pos = ctx->stats.pos.pos,
.p.sectors_done = atomic64_read(&ctx->stats.sectors_seen),
.p.sectors_error_corrected = atomic64_read(&ctx->stats.sectors_error_corrected),
.p.sectors_error_uncorrected = atomic64_read(&ctx->stats.sectors_error_uncorrected),
};
if (ctx->arg.op == BCH_DATA_OP_scrub) {
struct bch_dev *ca = bch2_dev_tryget(c, ctx->arg.scrub.dev);
if (ca) {
struct bch_dev_usage_full u;
bch2_dev_usage_full_read_fast(ca, &u);
for (unsigned i = BCH_DATA_btree; i < ARRAY_SIZE(u.d); i++)
if (ctx->arg.scrub.data_types & BIT(i))
e.p.sectors_total += u.d[i].sectors;
bch2_dev_put(ca);
}
} else {
e.p.sectors_total = bch2_fs_usage_read_short(c).used;
}
if (len < sizeof(e))
return -EINVAL;
return copy_to_user_errcode(buf, &e, sizeof(e)) ?: sizeof(e);
}
static const struct file_operations bcachefs_data_ops = {
.release = bch2_data_job_release,
.read = bch2_data_job_read,
};
static long bch2_ioctl_data(struct bch_fs *c,
struct bch_ioctl_data arg)
{
struct bch_data_ctx *ctx;
int ret;
if (!enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_ioctl_data))
return -EROFS;
if (!capable(CAP_SYS_ADMIN)) {
ret = -EPERM;
goto put_ref;
}
if (arg.op >= BCH_DATA_OP_NR || arg.flags) {
ret = -EINVAL;
goto put_ref;
}
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx) {
ret = -ENOMEM;
goto put_ref;
}
ctx->c = c;
ctx->arg = arg;
ret = bch2_run_thread_with_file(&ctx->thr,
&bcachefs_data_ops,
bch2_data_thread);
if (ret < 0)
goto cleanup;
return ret;
cleanup:
kfree(ctx);
put_ref:
enumerated_ref_put(&c->writes, BCH_WRITE_REF_ioctl_data);
return ret;
}
static noinline_for_stack long bch2_ioctl_fs_usage(struct bch_fs *c,
struct bch_ioctl_fs_usage __user *user_arg)
{
struct bch_ioctl_fs_usage arg = {};
darray_char replicas = {};
u32 replica_entries_bytes;
int ret = 0;
if (!test_bit(BCH_FS_started, &c->flags))
return -EINVAL;
if (get_user(replica_entries_bytes, &user_arg->replica_entries_bytes))
return -EFAULT;
ret = bch2_fs_replicas_usage_read(c, &replicas) ?:
(replica_entries_bytes < replicas.nr ? -ERANGE : 0) ?:
copy_to_user_errcode(&user_arg->replicas, replicas.data, replicas.nr);
if (ret)
goto err;
struct bch_fs_usage_short u = bch2_fs_usage_read_short(c);
arg.capacity = c->capacity;
arg.used = u.used;
arg.online_reserved = percpu_u64_get(c->online_reserved);
arg.replica_entries_bytes = replicas.nr;
for (unsigned i = 0; i < BCH_REPLICAS_MAX; i++) {
struct disk_accounting_pos k;
disk_accounting_key_init(k, persistent_reserved, .nr_replicas = i);
bch2_accounting_mem_read(c,
disk_accounting_pos_to_bpos(&k),
&arg.persistent_reserved[i], 1);
}
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
err:
darray_exit(&replicas);
return ret;
}
static long bch2_ioctl_query_accounting(struct bch_fs *c,
struct bch_ioctl_query_accounting __user *user_arg)
{
struct bch_ioctl_query_accounting arg;
darray_char accounting = {};
int ret = 0;
if (!test_bit(BCH_FS_started, &c->flags))
return -EINVAL;
ret = copy_from_user_errcode(&arg, user_arg, sizeof(arg)) ?:
bch2_fs_accounting_read(c, &accounting, arg.accounting_types_mask) ?:
(arg.accounting_u64s * sizeof(u64) < accounting.nr ? -ERANGE : 0) ?:
copy_to_user_errcode(&user_arg->accounting, accounting.data, accounting.nr);
if (ret)
goto err;
arg.capacity = c->capacity;
arg.used = bch2_fs_usage_read_short(c).used;
arg.online_reserved = percpu_u64_get(c->online_reserved);
arg.accounting_u64s = accounting.nr / sizeof(u64);
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
err:
darray_exit(&accounting);
return ret;
}
/* obsolete, didn't allow for new data types: */
static noinline_for_stack long bch2_ioctl_dev_usage(struct bch_fs *c,
struct bch_ioctl_dev_usage __user *user_arg)
{
struct bch_ioctl_dev_usage arg;
struct bch_dev_usage_full src;
struct bch_dev *ca;
unsigned i;
if (!test_bit(BCH_FS_started, &c->flags))
return -EINVAL;
if (copy_from_user(&arg, user_arg, sizeof(arg)))
return -EFAULT;
if ((arg.flags & ~BCH_BY_INDEX) ||
arg.pad[0] ||
arg.pad[1] ||
arg.pad[2])
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
src = bch2_dev_usage_full_read(ca);
arg.state = ca->mi.state;
arg.bucket_size = ca->mi.bucket_size;
arg.nr_buckets = ca->mi.nbuckets - ca->mi.first_bucket;
for (i = 0; i < ARRAY_SIZE(arg.d); i++) {
arg.d[i].buckets = src.d[i].buckets;
arg.d[i].sectors = src.d[i].sectors;
arg.d[i].fragmented = src.d[i].fragmented;
}
bch2_dev_put(ca);
return copy_to_user_errcode(user_arg, &arg, sizeof(arg));
}
static long bch2_ioctl_dev_usage_v2(struct bch_fs *c,
struct bch_ioctl_dev_usage_v2 __user *user_arg)
{
struct bch_ioctl_dev_usage_v2 arg;
struct bch_dev_usage_full src;
struct bch_dev *ca;
int ret = 0;
if (!test_bit(BCH_FS_started, &c->flags))
return -EINVAL;
if (copy_from_user(&arg, user_arg, sizeof(arg)))
return -EFAULT;
if ((arg.flags & ~BCH_BY_INDEX) ||
arg.pad[0] ||
arg.pad[1] ||
arg.pad[2])
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
src = bch2_dev_usage_full_read(ca);
arg.state = ca->mi.state;
arg.bucket_size = ca->mi.bucket_size;
arg.nr_data_types = min(arg.nr_data_types, BCH_DATA_NR);
arg.nr_buckets = ca->mi.nbuckets - ca->mi.first_bucket;
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
if (ret)
goto err;
for (unsigned i = 0; i < arg.nr_data_types; i++) {
struct bch_ioctl_dev_usage_type t = {
.buckets = src.d[i].buckets,
.sectors = src.d[i].sectors,
.fragmented = src.d[i].fragmented,
};
ret = copy_to_user_errcode(&user_arg->d[i], &t, sizeof(t));
if (ret)
goto err;
}
err:
bch2_dev_put(ca);
return ret;
}
static long bch2_ioctl_read_super(struct bch_fs *c,
struct bch_ioctl_read_super arg)
{
struct bch_dev *ca = NULL;
struct bch_sb *sb;
int ret = 0;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~(BCH_BY_INDEX|BCH_READ_DEV)) ||
arg.pad)
return -EINVAL;
mutex_lock(&c->sb_lock);
if (arg.flags & BCH_READ_DEV) {
ca = bch2_device_lookup(c, arg.dev, arg.flags);
ret = PTR_ERR_OR_ZERO(ca);
if (ret)
goto err_unlock;
sb = ca->disk_sb.sb;
} else {
sb = c->disk_sb.sb;
}
if (vstruct_bytes(sb) > arg.size) {
ret = -ERANGE;
goto err;
}
ret = copy_to_user_errcode((void __user *)(unsigned long)arg.sb, sb,
vstruct_bytes(sb));
err:
bch2_dev_put(ca);
err_unlock:
mutex_unlock(&c->sb_lock);
return ret;
}
static long bch2_ioctl_disk_get_idx(struct bch_fs *c,
struct bch_ioctl_disk_get_idx arg)
{
dev_t dev = huge_decode_dev(arg.dev);
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (!dev)
return -EINVAL;
guard(rcu)();
for_each_online_member_rcu(c, ca)
if (ca->dev == dev)
return ca->dev_idx;
return bch_err_throw(c, ENOENT_dev_idx_not_found);
}
static long bch2_ioctl_disk_resize(struct bch_fs *c,
struct bch_ioctl_disk_resize arg)
{
struct bch_dev *ca;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~BCH_BY_INDEX) ||
arg.pad)
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
ret = bch2_dev_resize(c, ca, arg.nbuckets);
bch2_dev_put(ca);
return ret;
}
static long bch2_ioctl_disk_resize_journal(struct bch_fs *c,
struct bch_ioctl_disk_resize_journal arg)
{
struct bch_dev *ca;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if ((arg.flags & ~BCH_BY_INDEX) ||
arg.pad)
return -EINVAL;
if (arg.nbuckets > U32_MAX)
return -EINVAL;
ca = bch2_device_lookup(c, arg.dev, arg.flags);
if (IS_ERR(ca))
return PTR_ERR(ca);
ret = bch2_set_nr_journal_buckets(c, ca, arg.nbuckets);
bch2_dev_put(ca);
return ret;
}
#define BCH_IOCTL(_name, _argtype) \
do { \
_argtype i; \
\
if (copy_from_user(&i, arg, sizeof(i))) \
return -EFAULT; \
ret = bch2_ioctl_##_name(c, i); \
goto out; \
} while (0)
long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg)
{
long ret;
switch (cmd) {
case BCH_IOCTL_QUERY_UUID:
return bch2_ioctl_query_uuid(c, arg);
case BCH_IOCTL_FS_USAGE:
return bch2_ioctl_fs_usage(c, arg);
case BCH_IOCTL_DEV_USAGE:
return bch2_ioctl_dev_usage(c, arg);
case BCH_IOCTL_DEV_USAGE_V2:
return bch2_ioctl_dev_usage_v2(c, arg);
#if 0
case BCH_IOCTL_START:
BCH_IOCTL(start, struct bch_ioctl_start);
case BCH_IOCTL_STOP:
return bch2_ioctl_stop(c);
#endif
case BCH_IOCTL_READ_SUPER:
BCH_IOCTL(read_super, struct bch_ioctl_read_super);
case BCH_IOCTL_DISK_GET_IDX:
BCH_IOCTL(disk_get_idx, struct bch_ioctl_disk_get_idx);
}
if (!test_bit(BCH_FS_started, &c->flags))
return -EINVAL;
switch (cmd) {
case BCH_IOCTL_DISK_ADD:
BCH_IOCTL(disk_add, struct bch_ioctl_disk);
case BCH_IOCTL_DISK_REMOVE:
BCH_IOCTL(disk_remove, struct bch_ioctl_disk);
case BCH_IOCTL_DISK_ONLINE:
BCH_IOCTL(disk_online, struct bch_ioctl_disk);
case BCH_IOCTL_DISK_OFFLINE:
BCH_IOCTL(disk_offline, struct bch_ioctl_disk);
case BCH_IOCTL_DISK_SET_STATE:
BCH_IOCTL(disk_set_state, struct bch_ioctl_disk_set_state);
case BCH_IOCTL_DATA:
BCH_IOCTL(data, struct bch_ioctl_data);
case BCH_IOCTL_DISK_RESIZE:
BCH_IOCTL(disk_resize, struct bch_ioctl_disk_resize);
case BCH_IOCTL_DISK_RESIZE_JOURNAL:
BCH_IOCTL(disk_resize_journal, struct bch_ioctl_disk_resize_journal);
case BCH_IOCTL_FSCK_ONLINE:
BCH_IOCTL(fsck_online, struct bch_ioctl_fsck_online);
case BCH_IOCTL_QUERY_ACCOUNTING:
return bch2_ioctl_query_accounting(c, arg);
case BCH_IOCTL_QUERY_COUNTERS:
return bch2_ioctl_query_counters(c, arg);
default:
return -ENOTTY;
}
out:
if (ret < 0)
ret = bch2_err_class(ret);
return ret;
}
static DEFINE_IDR(bch_chardev_minor);
static long bch2_chardev_ioctl(struct file *filp, unsigned cmd, unsigned long v)
{
unsigned minor = iminor(file_inode(filp));
struct bch_fs *c = minor < U8_MAX ? idr_find(&bch_chardev_minor, minor) : NULL;
void __user *arg = (void __user *) v;
return c
? bch2_fs_ioctl(c, cmd, arg)
: bch2_global_ioctl(cmd, arg);
}
static const struct file_operations bch_chardev_fops = {
.owner = THIS_MODULE,
.unlocked_ioctl = bch2_chardev_ioctl,
.open = nonseekable_open,
};
static int bch_chardev_major;
static const struct class bch_chardev_class = {
.name = "bcachefs",
};
static struct device *bch_chardev;
void bch2_fs_chardev_exit(struct bch_fs *c)
{
if (!IS_ERR_OR_NULL(c->chardev))
device_unregister(c->chardev);
if (c->minor >= 0)
idr_remove(&bch_chardev_minor, c->minor);
}
int bch2_fs_chardev_init(struct bch_fs *c)
{
c->minor = idr_alloc(&bch_chardev_minor, c, 0, 0, GFP_KERNEL);
if (c->minor < 0)
return c->minor;
c->chardev = device_create(&bch_chardev_class, NULL,
MKDEV(bch_chardev_major, c->minor), c,
"bcachefs%u-ctl", c->minor);
if (IS_ERR(c->chardev))
return PTR_ERR(c->chardev);
return 0;
}
void bch2_chardev_exit(void)
{
device_destroy(&bch_chardev_class, MKDEV(bch_chardev_major, U8_MAX));
class_unregister(&bch_chardev_class);
if (bch_chardev_major > 0)
unregister_chrdev(bch_chardev_major, "bcachefs");
}
int __init bch2_chardev_init(void)
{
int ret;
bch_chardev_major = register_chrdev(0, "bcachefs-ctl", &bch_chardev_fops);
if (bch_chardev_major < 0)
return bch_chardev_major;
ret = class_register(&bch_chardev_class);
if (ret)
goto major_out;
bch_chardev = device_create(&bch_chardev_class, NULL,
MKDEV(bch_chardev_major, U8_MAX),
NULL, "bcachefs-ctl");
if (IS_ERR(bch_chardev)) {
ret = PTR_ERR(bch_chardev);
goto class_out;
}
return 0;
class_out:
class_unregister(&bch_chardev_class);
major_out:
unregister_chrdev(bch_chardev_major, "bcachefs-ctl");
return ret;
}
#endif /* NO_BCACHEFS_CHARDEV */

View File

@ -1,31 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_CHARDEV_H
#define _BCACHEFS_CHARDEV_H
#ifndef NO_BCACHEFS_FS
long bch2_fs_ioctl(struct bch_fs *, unsigned, void __user *);
void bch2_fs_chardev_exit(struct bch_fs *);
int bch2_fs_chardev_init(struct bch_fs *);
void bch2_chardev_exit(void);
int __init bch2_chardev_init(void);
#else
static inline long bch2_fs_ioctl(struct bch_fs *c,
unsigned cmd, void __user * arg)
{
return -ENOTTY;
}
static inline void bch2_fs_chardev_exit(struct bch_fs *c) {}
static inline int bch2_fs_chardev_init(struct bch_fs *c) { return 0; }
static inline void bch2_chardev_exit(void) {}
static inline int __init bch2_chardev_init(void) { return 0; }
#endif /* NO_BCACHEFS_FS */
#endif /* _BCACHEFS_CHARDEV_H */

View File

@ -1,698 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "checksum.h"
#include "errcode.h"
#include "error.h"
#include "super.h"
#include "super-io.h"
#include <linux/crc32c.h>
#include <linux/xxhash.h>
#include <linux/key.h>
#include <linux/random.h>
#include <linux/ratelimit.h>
#include <crypto/chacha.h>
#include <crypto/poly1305.h>
#include <keys/user-type.h>
/*
* bch2_checksum state is an abstraction of the checksum state calculated over different pages.
* it features page merging without having the checksum algorithm lose its state.
* for native checksum aglorithms (like crc), a default seed value will do.
* for hash-like algorithms, a state needs to be stored
*/
struct bch2_checksum_state {
union {
u64 seed;
struct xxh64_state h64state;
};
unsigned int type;
};
static void bch2_checksum_init(struct bch2_checksum_state *state)
{
switch (state->type) {
case BCH_CSUM_none:
case BCH_CSUM_crc32c:
case BCH_CSUM_crc64:
state->seed = 0;
break;
case BCH_CSUM_crc32c_nonzero:
state->seed = U32_MAX;
break;
case BCH_CSUM_crc64_nonzero:
state->seed = U64_MAX;
break;
case BCH_CSUM_xxhash:
xxh64_reset(&state->h64state, 0);
break;
default:
BUG();
}
}
static u64 bch2_checksum_final(const struct bch2_checksum_state *state)
{
switch (state->type) {
case BCH_CSUM_none:
case BCH_CSUM_crc32c:
case BCH_CSUM_crc64:
return state->seed;
case BCH_CSUM_crc32c_nonzero:
return state->seed ^ U32_MAX;
case BCH_CSUM_crc64_nonzero:
return state->seed ^ U64_MAX;
case BCH_CSUM_xxhash:
return xxh64_digest(&state->h64state);
default:
BUG();
}
}
static void bch2_checksum_update(struct bch2_checksum_state *state, const void *data, size_t len)
{
switch (state->type) {
case BCH_CSUM_none:
return;
case BCH_CSUM_crc32c_nonzero:
case BCH_CSUM_crc32c:
state->seed = crc32c(state->seed, data, len);
break;
case BCH_CSUM_crc64_nonzero:
case BCH_CSUM_crc64:
state->seed = crc64_be(state->seed, data, len);
break;
case BCH_CSUM_xxhash:
xxh64_update(&state->h64state, data, len);
break;
default:
BUG();
}
}
static void bch2_chacha20_init(struct chacha_state *state,
const struct bch_key *key, struct nonce nonce)
{
u32 key_words[CHACHA_KEY_SIZE / sizeof(u32)];
BUILD_BUG_ON(sizeof(key_words) != sizeof(*key));
memcpy(key_words, key, sizeof(key_words));
le32_to_cpu_array(key_words, ARRAY_SIZE(key_words));
BUILD_BUG_ON(sizeof(nonce) != CHACHA_IV_SIZE);
chacha_init(state, key_words, (const u8 *)nonce.d);
memzero_explicit(key_words, sizeof(key_words));
}
void bch2_chacha20(const struct bch_key *key, struct nonce nonce,
void *data, size_t len)
{
struct chacha_state state;
bch2_chacha20_init(&state, key, nonce);
chacha20_crypt(&state, data, data, len);
chacha_zeroize_state(&state);
}
static void bch2_poly1305_init(struct poly1305_desc_ctx *desc,
struct bch_fs *c, struct nonce nonce)
{
u8 key[POLY1305_KEY_SIZE] = { 0 };
nonce.d[3] ^= BCH_NONCE_POLY;
bch2_chacha20(&c->chacha20_key, nonce, key, sizeof(key));
poly1305_init(desc, key);
}
struct bch_csum bch2_checksum(struct bch_fs *c, unsigned type,
struct nonce nonce, const void *data, size_t len)
{
switch (type) {
case BCH_CSUM_none:
case BCH_CSUM_crc32c_nonzero:
case BCH_CSUM_crc64_nonzero:
case BCH_CSUM_crc32c:
case BCH_CSUM_xxhash:
case BCH_CSUM_crc64: {
struct bch2_checksum_state state;
state.type = type;
bch2_checksum_init(&state);
bch2_checksum_update(&state, data, len);
return (struct bch_csum) { .lo = cpu_to_le64(bch2_checksum_final(&state)) };
}
case BCH_CSUM_chacha20_poly1305_80:
case BCH_CSUM_chacha20_poly1305_128: {
struct poly1305_desc_ctx dctx;
u8 digest[POLY1305_DIGEST_SIZE];
struct bch_csum ret = { 0 };
bch2_poly1305_init(&dctx, c, nonce);
poly1305_update(&dctx, data, len);
poly1305_final(&dctx, digest);
memcpy(&ret, digest, bch_crc_bytes[type]);
return ret;
}
default:
return (struct bch_csum) {};
}
}
int bch2_encrypt(struct bch_fs *c, unsigned type,
struct nonce nonce, void *data, size_t len)
{
if (!bch2_csum_type_is_encryption(type))
return 0;
if (bch2_fs_inconsistent_on(!c->chacha20_key_set,
c, "attempting to encrypt without encryption key"))
return bch_err_throw(c, no_encryption_key);
bch2_chacha20(&c->chacha20_key, nonce, data, len);
return 0;
}
static struct bch_csum __bch2_checksum_bio(struct bch_fs *c, unsigned type,
struct nonce nonce, struct bio *bio,
struct bvec_iter *iter)
{
struct bio_vec bv;
switch (type) {
case BCH_CSUM_none:
return (struct bch_csum) { 0 };
case BCH_CSUM_crc32c_nonzero:
case BCH_CSUM_crc64_nonzero:
case BCH_CSUM_crc32c:
case BCH_CSUM_xxhash:
case BCH_CSUM_crc64: {
struct bch2_checksum_state state;
state.type = type;
bch2_checksum_init(&state);
#ifdef CONFIG_HIGHMEM
__bio_for_each_segment(bv, bio, *iter, *iter) {
void *p = kmap_local_page(bv.bv_page) + bv.bv_offset;
bch2_checksum_update(&state, p, bv.bv_len);
kunmap_local(p);
}
#else
__bio_for_each_bvec(bv, bio, *iter, *iter)
bch2_checksum_update(&state, page_address(bv.bv_page) + bv.bv_offset,
bv.bv_len);
#endif
return (struct bch_csum) { .lo = cpu_to_le64(bch2_checksum_final(&state)) };
}
case BCH_CSUM_chacha20_poly1305_80:
case BCH_CSUM_chacha20_poly1305_128: {
struct poly1305_desc_ctx dctx;
u8 digest[POLY1305_DIGEST_SIZE];
struct bch_csum ret = { 0 };
bch2_poly1305_init(&dctx, c, nonce);
#ifdef CONFIG_HIGHMEM
__bio_for_each_segment(bv, bio, *iter, *iter) {
void *p = kmap_local_page(bv.bv_page) + bv.bv_offset;
poly1305_update(&dctx, p, bv.bv_len);
kunmap_local(p);
}
#else
__bio_for_each_bvec(bv, bio, *iter, *iter)
poly1305_update(&dctx,
page_address(bv.bv_page) + bv.bv_offset,
bv.bv_len);
#endif
poly1305_final(&dctx, digest);
memcpy(&ret, digest, bch_crc_bytes[type]);
return ret;
}
default:
return (struct bch_csum) {};
}
}
struct bch_csum bch2_checksum_bio(struct bch_fs *c, unsigned type,
struct nonce nonce, struct bio *bio)
{
struct bvec_iter iter = bio->bi_iter;
return __bch2_checksum_bio(c, type, nonce, bio, &iter);
}
int __bch2_encrypt_bio(struct bch_fs *c, unsigned type,
struct nonce nonce, struct bio *bio)
{
struct bio_vec bv;
struct bvec_iter iter;
struct chacha_state chacha_state;
int ret = 0;
if (bch2_fs_inconsistent_on(!c->chacha20_key_set,
c, "attempting to encrypt without encryption key"))
return bch_err_throw(c, no_encryption_key);
bch2_chacha20_init(&chacha_state, &c->chacha20_key, nonce);
bio_for_each_segment(bv, bio, iter) {
void *p;
/*
* chacha_crypt() assumes that the length is a multiple of
* CHACHA_BLOCK_SIZE on any non-final call.
*/
if (!IS_ALIGNED(bv.bv_len, CHACHA_BLOCK_SIZE)) {
bch_err_ratelimited(c, "bio not aligned for encryption");
ret = -EIO;
break;
}
p = bvec_kmap_local(&bv);
chacha20_crypt(&chacha_state, p, p, bv.bv_len);
kunmap_local(p);
}
chacha_zeroize_state(&chacha_state);
return ret;
}
struct bch_csum bch2_checksum_merge(unsigned type, struct bch_csum a,
struct bch_csum b, size_t b_len)
{
struct bch2_checksum_state state;
state.type = type;
bch2_checksum_init(&state);
state.seed = le64_to_cpu(a.lo);
BUG_ON(!bch2_checksum_mergeable(type));
while (b_len) {
unsigned page_len = min_t(unsigned, b_len, PAGE_SIZE);
bch2_checksum_update(&state,
page_address(ZERO_PAGE(0)), page_len);
b_len -= page_len;
}
a.lo = cpu_to_le64(bch2_checksum_final(&state));
a.lo ^= b.lo;
a.hi ^= b.hi;
return a;
}
int bch2_rechecksum_bio(struct bch_fs *c, struct bio *bio,
struct bversion version,
struct bch_extent_crc_unpacked crc_old,
struct bch_extent_crc_unpacked *crc_a,
struct bch_extent_crc_unpacked *crc_b,
unsigned len_a, unsigned len_b,
unsigned new_csum_type)
{
struct bvec_iter iter = bio->bi_iter;
struct nonce nonce = extent_nonce(version, crc_old);
struct bch_csum merged = { 0 };
struct crc_split {
struct bch_extent_crc_unpacked *crc;
unsigned len;
unsigned csum_type;
struct bch_csum csum;
} splits[3] = {
{ crc_a, len_a, new_csum_type, { 0 }},
{ crc_b, len_b, new_csum_type, { 0 } },
{ NULL, bio_sectors(bio) - len_a - len_b, new_csum_type, { 0 } },
}, *i;
bool mergeable = crc_old.csum_type == new_csum_type &&
bch2_checksum_mergeable(new_csum_type);
unsigned crc_nonce = crc_old.nonce;
BUG_ON(len_a + len_b > bio_sectors(bio));
BUG_ON(crc_old.uncompressed_size != bio_sectors(bio));
BUG_ON(crc_is_compressed(crc_old));
BUG_ON(bch2_csum_type_is_encryption(crc_old.csum_type) !=
bch2_csum_type_is_encryption(new_csum_type));
for (i = splits; i < splits + ARRAY_SIZE(splits); i++) {
iter.bi_size = i->len << 9;
if (mergeable || i->crc)
i->csum = __bch2_checksum_bio(c, i->csum_type,
nonce, bio, &iter);
else
bio_advance_iter(bio, &iter, i->len << 9);
nonce = nonce_add(nonce, i->len << 9);
}
if (mergeable)
for (i = splits; i < splits + ARRAY_SIZE(splits); i++)
merged = bch2_checksum_merge(new_csum_type, merged,
i->csum, i->len << 9);
else
merged = bch2_checksum_bio(c, crc_old.csum_type,
extent_nonce(version, crc_old), bio);
if (bch2_crc_cmp(merged, crc_old.csum) && !c->opts.no_data_io) {
struct printbuf buf = PRINTBUF;
prt_printf(&buf, "checksum error in %s() (memory corruption or bug?)\n"
" expected %0llx:%0llx got %0llx:%0llx (old type ",
__func__,
crc_old.csum.hi,
crc_old.csum.lo,
merged.hi,
merged.lo);
bch2_prt_csum_type(&buf, crc_old.csum_type);
prt_str(&buf, " new type ");
bch2_prt_csum_type(&buf, new_csum_type);
prt_str(&buf, ")");
WARN_RATELIMIT(1, "%s", buf.buf);
printbuf_exit(&buf);
return bch_err_throw(c, recompute_checksum);
}
for (i = splits; i < splits + ARRAY_SIZE(splits); i++) {
if (i->crc)
*i->crc = (struct bch_extent_crc_unpacked) {
.csum_type = i->csum_type,
.compression_type = crc_old.compression_type,
.compressed_size = i->len,
.uncompressed_size = i->len,
.offset = 0,
.live_size = i->len,
.nonce = crc_nonce,
.csum = i->csum,
};
if (bch2_csum_type_is_encryption(new_csum_type))
crc_nonce += i->len;
}
return 0;
}
/* BCH_SB_FIELD_crypt: */
static int bch2_sb_crypt_validate(struct bch_sb *sb, struct bch_sb_field *f,
enum bch_validate_flags flags, struct printbuf *err)
{
struct bch_sb_field_crypt *crypt = field_to_type(f, crypt);
if (vstruct_bytes(&crypt->field) < sizeof(*crypt)) {
prt_printf(err, "wrong size (got %zu should be %zu)",
vstruct_bytes(&crypt->field), sizeof(*crypt));
return -BCH_ERR_invalid_sb_crypt;
}
if (BCH_CRYPT_KDF_TYPE(crypt)) {
prt_printf(err, "bad kdf type %llu", BCH_CRYPT_KDF_TYPE(crypt));
return -BCH_ERR_invalid_sb_crypt;
}
return 0;
}
static void bch2_sb_crypt_to_text(struct printbuf *out, struct bch_sb *sb,
struct bch_sb_field *f)
{
struct bch_sb_field_crypt *crypt = field_to_type(f, crypt);
prt_printf(out, "KFD: %llu\n", BCH_CRYPT_KDF_TYPE(crypt));
prt_printf(out, "scrypt n: %llu\n", BCH_KDF_SCRYPT_N(crypt));
prt_printf(out, "scrypt r: %llu\n", BCH_KDF_SCRYPT_R(crypt));
prt_printf(out, "scrypt p: %llu\n", BCH_KDF_SCRYPT_P(crypt));
}
const struct bch_sb_field_ops bch_sb_field_ops_crypt = {
.validate = bch2_sb_crypt_validate,
.to_text = bch2_sb_crypt_to_text,
};
#ifdef __KERNEL__
static int __bch2_request_key(char *key_description, struct bch_key *key)
{
struct key *keyring_key;
const struct user_key_payload *ukp;
int ret;
keyring_key = request_key(&key_type_user, key_description, NULL);
if (IS_ERR(keyring_key))
return PTR_ERR(keyring_key);
down_read(&keyring_key->sem);
ukp = dereference_key_locked(keyring_key);
if (ukp->datalen == sizeof(*key)) {
memcpy(key, ukp->data, ukp->datalen);
ret = 0;
} else {
ret = -EINVAL;
}
up_read(&keyring_key->sem);
key_put(keyring_key);
return ret;
}
#else
#include <keyutils.h>
static int __bch2_request_key(char *key_description, struct bch_key *key)
{
key_serial_t key_id;
key_id = request_key("user", key_description, NULL,
KEY_SPEC_SESSION_KEYRING);
if (key_id >= 0)
goto got_key;
key_id = request_key("user", key_description, NULL,
KEY_SPEC_USER_KEYRING);
if (key_id >= 0)
goto got_key;
key_id = request_key("user", key_description, NULL,
KEY_SPEC_USER_SESSION_KEYRING);
if (key_id >= 0)
goto got_key;
return -errno;
got_key:
if (keyctl_read(key_id, (void *) key, sizeof(*key)) != sizeof(*key))
return -1;
return 0;
}
#include "crypto.h"
#endif
int bch2_request_key(struct bch_sb *sb, struct bch_key *key)
{
struct printbuf key_description = PRINTBUF;
int ret;
prt_printf(&key_description, "bcachefs:");
pr_uuid(&key_description, sb->user_uuid.b);
ret = __bch2_request_key(key_description.buf, key);
printbuf_exit(&key_description);
#ifndef __KERNEL__
if (ret) {
char *passphrase = read_passphrase("Enter passphrase: ");
struct bch_encrypted_key sb_key;
bch2_passphrase_check(sb, passphrase,
key, &sb_key);
ret = 0;
}
#endif
/* stash with memfd, pass memfd fd to mount */
return ret;
}
#ifndef __KERNEL__
int bch2_revoke_key(struct bch_sb *sb)
{
key_serial_t key_id;
struct printbuf key_description = PRINTBUF;
prt_printf(&key_description, "bcachefs:");
pr_uuid(&key_description, sb->user_uuid.b);
key_id = request_key("user", key_description.buf, NULL, KEY_SPEC_USER_KEYRING);
printbuf_exit(&key_description);
if (key_id < 0)
return errno;
keyctl_revoke(key_id);
return 0;
}
#endif
int bch2_decrypt_sb_key(struct bch_fs *c,
struct bch_sb_field_crypt *crypt,
struct bch_key *key)
{
struct bch_encrypted_key sb_key = crypt->key;
struct bch_key user_key;
int ret = 0;
/* is key encrypted? */
if (!bch2_key_is_encrypted(&sb_key))
goto out;
ret = bch2_request_key(c->disk_sb.sb, &user_key);
if (ret) {
bch_err(c, "error requesting encryption key: %s", bch2_err_str(ret));
goto err;
}
/* decrypt real key: */
bch2_chacha20(&user_key, bch2_sb_key_nonce(c), &sb_key, sizeof(sb_key));
if (bch2_key_is_encrypted(&sb_key)) {
bch_err(c, "incorrect encryption key");
ret = -EINVAL;
goto err;
}
out:
*key = sb_key.key;
err:
memzero_explicit(&sb_key, sizeof(sb_key));
memzero_explicit(&user_key, sizeof(user_key));
return ret;
}
#if 0
/*
* This seems to be duplicating code in cmd_remove_passphrase() in
* bcachefs-tools, but we might want to switch userspace to use this - and
* perhaps add an ioctl for calling this at runtime, so we can take the
* passphrase off of a mounted filesystem (which has come up).
*/
int bch2_disable_encryption(struct bch_fs *c)
{
struct bch_sb_field_crypt *crypt;
struct bch_key key;
int ret = -EINVAL;
mutex_lock(&c->sb_lock);
crypt = bch2_sb_field_get(c->disk_sb.sb, crypt);
if (!crypt)
goto out;
/* is key encrypted? */
ret = 0;
if (bch2_key_is_encrypted(&crypt->key))
goto out;
ret = bch2_decrypt_sb_key(c, crypt, &key);
if (ret)
goto out;
crypt->key.magic = cpu_to_le64(BCH_KEY_MAGIC);
crypt->key.key = key;
SET_BCH_SB_ENCRYPTION_TYPE(c->disk_sb.sb, 0);
bch2_write_super(c);
out:
mutex_unlock(&c->sb_lock);
return ret;
}
/*
* For enabling encryption on an existing filesystem: not hooked up yet, but it
* should be
*/
int bch2_enable_encryption(struct bch_fs *c, bool keyed)
{
struct bch_encrypted_key key;
struct bch_key user_key;
struct bch_sb_field_crypt *crypt;
int ret = -EINVAL;
mutex_lock(&c->sb_lock);
/* Do we already have an encryption key? */
if (bch2_sb_field_get(c->disk_sb.sb, crypt))
goto err;
ret = bch2_alloc_ciphers(c);
if (ret)
goto err;
key.magic = cpu_to_le64(BCH_KEY_MAGIC);
get_random_bytes(&key.key, sizeof(key.key));
if (keyed) {
ret = bch2_request_key(c->disk_sb.sb, &user_key);
if (ret) {
bch_err(c, "error requesting encryption key: %s", bch2_err_str(ret));
goto err;
}
ret = bch2_chacha_encrypt_key(&user_key, bch2_sb_key_nonce(c),
&key, sizeof(key));
if (ret)
goto err;
}
ret = crypto_skcipher_setkey(&c->chacha20->base,
(void *) &key.key, sizeof(key.key));
if (ret)
goto err;
crypt = bch2_sb_field_resize(&c->disk_sb, crypt,
sizeof(*crypt) / sizeof(u64));
if (!crypt) {
ret = bch_err_throw(c, ENOSPC_sb_crypt);
goto err;
}
crypt->key = key;
/* write superblock */
SET_BCH_SB_ENCRYPTION_TYPE(c->disk_sb.sb, 1);
bch2_write_super(c);
err:
mutex_unlock(&c->sb_lock);
memzero_explicit(&user_key, sizeof(user_key));
memzero_explicit(&key, sizeof(key));
return ret;
}
#endif
void bch2_fs_encryption_exit(struct bch_fs *c)
{
memzero_explicit(&c->chacha20_key, sizeof(c->chacha20_key));
}
int bch2_fs_encryption_init(struct bch_fs *c)
{
struct bch_sb_field_crypt *crypt;
int ret;
crypt = bch2_sb_field_get(c->disk_sb.sb, crypt);
if (!crypt)
return 0;
ret = bch2_decrypt_sb_key(c, crypt, &c->chacha20_key);
if (ret)
return ret;
c->chacha20_key_set = true;
return 0;
}

View File

@ -1,240 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_CHECKSUM_H
#define _BCACHEFS_CHECKSUM_H
#include "bcachefs.h"
#include "extents_types.h"
#include "super-io.h"
#include <linux/crc64.h>
#include <crypto/chacha.h>
static inline bool bch2_checksum_mergeable(unsigned type)
{
switch (type) {
case BCH_CSUM_none:
case BCH_CSUM_crc32c:
case BCH_CSUM_crc64:
return true;
default:
return false;
}
}
struct bch_csum bch2_checksum_merge(unsigned, struct bch_csum,
struct bch_csum, size_t);
#define BCH_NONCE_EXTENT cpu_to_le32(1 << 28)
#define BCH_NONCE_BTREE cpu_to_le32(2 << 28)
#define BCH_NONCE_JOURNAL cpu_to_le32(3 << 28)
#define BCH_NONCE_PRIO cpu_to_le32(4 << 28)
#define BCH_NONCE_POLY cpu_to_le32(1 << 31)
struct bch_csum bch2_checksum(struct bch_fs *, unsigned, struct nonce,
const void *, size_t);
/*
* This is used for various on disk data structures - bch_sb, prio_set, bset,
* jset: The checksum is _always_ the first field of these structs
*/
#define csum_vstruct(_c, _type, _nonce, _i) \
({ \
const void *_start = ((const void *) (_i)) + sizeof((_i)->csum);\
\
bch2_checksum(_c, _type, _nonce, _start, vstruct_end(_i) - _start);\
})
static inline void bch2_csum_to_text(struct printbuf *out,
enum bch_csum_type type,
struct bch_csum csum)
{
const u8 *p = (u8 *) &csum;
unsigned bytes = type < BCH_CSUM_NR ? bch_crc_bytes[type] : 16;
for (unsigned i = 0; i < bytes; i++)
prt_hex_byte(out, p[i]);
}
static inline void bch2_csum_err_msg(struct printbuf *out,
enum bch_csum_type type,
struct bch_csum expected,
struct bch_csum got)
{
prt_str(out, "checksum error, type ");
bch2_prt_csum_type(out, type);
prt_str(out, ": got ");
bch2_csum_to_text(out, type, got);
prt_str(out, " should be ");
bch2_csum_to_text(out, type, expected);
}
void bch2_chacha20(const struct bch_key *, struct nonce, void *, size_t);
int bch2_request_key(struct bch_sb *, struct bch_key *);
#ifndef __KERNEL__
int bch2_revoke_key(struct bch_sb *);
#endif
int bch2_encrypt(struct bch_fs *, unsigned, struct nonce,
void *data, size_t);
struct bch_csum bch2_checksum_bio(struct bch_fs *, unsigned,
struct nonce, struct bio *);
int bch2_rechecksum_bio(struct bch_fs *, struct bio *, struct bversion,
struct bch_extent_crc_unpacked,
struct bch_extent_crc_unpacked *,
struct bch_extent_crc_unpacked *,
unsigned, unsigned, unsigned);
int __bch2_encrypt_bio(struct bch_fs *, unsigned,
struct nonce, struct bio *);
static inline int bch2_encrypt_bio(struct bch_fs *c, unsigned type,
struct nonce nonce, struct bio *bio)
{
return bch2_csum_type_is_encryption(type)
? __bch2_encrypt_bio(c, type, nonce, bio)
: 0;
}
extern const struct bch_sb_field_ops bch_sb_field_ops_crypt;
int bch2_decrypt_sb_key(struct bch_fs *, struct bch_sb_field_crypt *,
struct bch_key *);
#if 0
int bch2_disable_encryption(struct bch_fs *);
int bch2_enable_encryption(struct bch_fs *, bool);
#endif
void bch2_fs_encryption_exit(struct bch_fs *);
int bch2_fs_encryption_init(struct bch_fs *);
static inline enum bch_csum_type bch2_csum_opt_to_type(enum bch_csum_opt type,
bool data)
{
switch (type) {
case BCH_CSUM_OPT_none:
return BCH_CSUM_none;
case BCH_CSUM_OPT_crc32c:
return data ? BCH_CSUM_crc32c : BCH_CSUM_crc32c_nonzero;
case BCH_CSUM_OPT_crc64:
return data ? BCH_CSUM_crc64 : BCH_CSUM_crc64_nonzero;
case BCH_CSUM_OPT_xxhash:
return BCH_CSUM_xxhash;
default:
BUG();
}
}
static inline enum bch_csum_type bch2_data_checksum_type(struct bch_fs *c,
struct bch_io_opts opts)
{
if (opts.nocow)
return 0;
if (c->sb.encryption_type)
return c->opts.wide_macs
? BCH_CSUM_chacha20_poly1305_128
: BCH_CSUM_chacha20_poly1305_80;
return bch2_csum_opt_to_type(opts.data_checksum, true);
}
static inline enum bch_csum_type bch2_meta_checksum_type(struct bch_fs *c)
{
if (c->sb.encryption_type)
return BCH_CSUM_chacha20_poly1305_128;
return bch2_csum_opt_to_type(c->opts.metadata_checksum, false);
}
static inline bool bch2_checksum_type_valid(const struct bch_fs *c,
unsigned type)
{
if (type >= BCH_CSUM_NR)
return false;
if (bch2_csum_type_is_encryption(type) && !c->chacha20_key_set)
return false;
return true;
}
/* returns true if not equal */
static inline bool bch2_crc_cmp(struct bch_csum l, struct bch_csum r)
{
/*
* XXX: need some way of preventing the compiler from optimizing this
* into a form that isn't constant time..
*/
return ((l.lo ^ r.lo) | (l.hi ^ r.hi)) != 0;
}
/* for skipping ahead and encrypting/decrypting at an offset: */
static inline struct nonce nonce_add(struct nonce nonce, unsigned offset)
{
EBUG_ON(offset & (CHACHA_BLOCK_SIZE - 1));
le32_add_cpu(&nonce.d[0], offset / CHACHA_BLOCK_SIZE);
return nonce;
}
static inline struct nonce null_nonce(void)
{
struct nonce ret;
memset(&ret, 0, sizeof(ret));
return ret;
}
static inline struct nonce extent_nonce(struct bversion version,
struct bch_extent_crc_unpacked crc)
{
unsigned compression_type = crc_is_compressed(crc)
? crc.compression_type
: 0;
unsigned size = compression_type ? crc.uncompressed_size : 0;
struct nonce nonce = (struct nonce) {{
[0] = cpu_to_le32(size << 22),
[1] = cpu_to_le32(version.lo),
[2] = cpu_to_le32(version.lo >> 32),
[3] = cpu_to_le32(version.hi|
(compression_type << 24))^BCH_NONCE_EXTENT,
}};
return nonce_add(nonce, crc.nonce << 9);
}
static inline bool bch2_key_is_encrypted(struct bch_encrypted_key *key)
{
return le64_to_cpu(key->magic) != BCH_KEY_MAGIC;
}
static inline struct nonce __bch2_sb_key_nonce(struct bch_sb *sb)
{
__le64 magic = __bch2_sb_magic(sb);
return (struct nonce) {{
[0] = 0,
[1] = 0,
[2] = ((__le32 *) &magic)[0],
[3] = ((__le32 *) &magic)[1],
}};
}
static inline struct nonce bch2_sb_key_nonce(struct bch_fs *c)
{
__le64 magic = bch2_sb_magic(c);
return (struct nonce) {{
[0] = 0,
[1] = 0,
[2] = ((__le32 *) &magic)[0],
[3] = ((__le32 *) &magic)[1],
}};
}
#endif /* _BCACHEFS_CHECKSUM_H */

View File

@ -1,181 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "clock.h"
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/preempt.h>
static inline bool io_timer_cmp(const void *l, const void *r, void __always_unused *args)
{
struct io_timer **_l = (struct io_timer **)l;
struct io_timer **_r = (struct io_timer **)r;
return (*_l)->expire < (*_r)->expire;
}
static const struct min_heap_callbacks callbacks = {
.less = io_timer_cmp,
.swp = NULL,
};
void bch2_io_timer_add(struct io_clock *clock, struct io_timer *timer)
{
spin_lock(&clock->timer_lock);
if (time_after_eq64((u64) atomic64_read(&clock->now), timer->expire)) {
spin_unlock(&clock->timer_lock);
timer->fn(timer);
return;
}
for (size_t i = 0; i < clock->timers.nr; i++)
if (clock->timers.data[i] == timer)
goto out;
BUG_ON(!min_heap_push(&clock->timers, &timer, &callbacks, NULL));
out:
spin_unlock(&clock->timer_lock);
}
void bch2_io_timer_del(struct io_clock *clock, struct io_timer *timer)
{
spin_lock(&clock->timer_lock);
for (size_t i = 0; i < clock->timers.nr; i++)
if (clock->timers.data[i] == timer) {
min_heap_del(&clock->timers, i, &callbacks, NULL);
break;
}
spin_unlock(&clock->timer_lock);
}
struct io_clock_wait {
struct io_timer io_timer;
struct task_struct *task;
int expired;
};
static void io_clock_wait_fn(struct io_timer *timer)
{
struct io_clock_wait *wait = container_of(timer,
struct io_clock_wait, io_timer);
wait->expired = 1;
wake_up_process(wait->task);
}
void bch2_io_clock_schedule_timeout(struct io_clock *clock, u64 until)
{
struct io_clock_wait wait = {
.io_timer.expire = until,
.io_timer.fn = io_clock_wait_fn,
.io_timer.fn2 = (void *) _RET_IP_,
.task = current,
};
bch2_io_timer_add(clock, &wait.io_timer);
schedule();
bch2_io_timer_del(clock, &wait.io_timer);
}
unsigned long bch2_kthread_io_clock_wait_once(struct io_clock *clock,
u64 io_until, unsigned long cpu_timeout)
{
bool kthread = (current->flags & PF_KTHREAD) != 0;
struct io_clock_wait wait = {
.io_timer.expire = io_until,
.io_timer.fn = io_clock_wait_fn,
.io_timer.fn2 = (void *) _RET_IP_,
.task = current,
};
bch2_io_timer_add(clock, &wait.io_timer);
set_current_state(TASK_INTERRUPTIBLE);
if (!(kthread && kthread_should_stop())) {
cpu_timeout = schedule_timeout(cpu_timeout);
try_to_freeze();
}
__set_current_state(TASK_RUNNING);
bch2_io_timer_del(clock, &wait.io_timer);
return cpu_timeout;
}
void bch2_kthread_io_clock_wait(struct io_clock *clock,
u64 io_until, unsigned long cpu_timeout)
{
bool kthread = (current->flags & PF_KTHREAD) != 0;
while (!(kthread && kthread_should_stop()) &&
cpu_timeout &&
atomic64_read(&clock->now) < io_until)
cpu_timeout = bch2_kthread_io_clock_wait_once(clock, io_until, cpu_timeout);
}
static struct io_timer *get_expired_timer(struct io_clock *clock, u64 now)
{
struct io_timer *ret = NULL;
if (clock->timers.nr &&
time_after_eq64(now, clock->timers.data[0]->expire)) {
ret = *min_heap_peek(&clock->timers);
min_heap_pop(&clock->timers, &callbacks, NULL);
}
return ret;
}
void __bch2_increment_clock(struct io_clock *clock, u64 sectors)
{
struct io_timer *timer;
u64 now = atomic64_add_return(sectors, &clock->now);
spin_lock(&clock->timer_lock);
while ((timer = get_expired_timer(clock, now)))
timer->fn(timer);
spin_unlock(&clock->timer_lock);
}
void bch2_io_timers_to_text(struct printbuf *out, struct io_clock *clock)
{
out->atomic++;
spin_lock(&clock->timer_lock);
u64 now = atomic64_read(&clock->now);
printbuf_tabstop_push(out, 40);
prt_printf(out, "current time:\t%llu\n", now);
for (unsigned i = 0; i < clock->timers.nr; i++)
prt_printf(out, "%ps %ps:\t%llu\n",
clock->timers.data[i]->fn,
clock->timers.data[i]->fn2,
clock->timers.data[i]->expire);
spin_unlock(&clock->timer_lock);
--out->atomic;
}
void bch2_io_clock_exit(struct io_clock *clock)
{
free_heap(&clock->timers);
free_percpu(clock->pcpu_buf);
}
int bch2_io_clock_init(struct io_clock *clock)
{
atomic64_set(&clock->now, 0);
spin_lock_init(&clock->timer_lock);
clock->max_slop = IO_CLOCK_PCPU_SECTORS * num_possible_cpus();
clock->pcpu_buf = alloc_percpu(*clock->pcpu_buf);
if (!clock->pcpu_buf)
return -BCH_ERR_ENOMEM_io_clock_init;
if (!init_heap(&clock->timers, NR_IO_TIMERS, GFP_KERNEL))
return -BCH_ERR_ENOMEM_io_clock_init;
return 0;
}

View File

@ -1,29 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_CLOCK_H
#define _BCACHEFS_CLOCK_H
void bch2_io_timer_add(struct io_clock *, struct io_timer *);
void bch2_io_timer_del(struct io_clock *, struct io_timer *);
unsigned long bch2_kthread_io_clock_wait_once(struct io_clock *, u64, unsigned long);
void bch2_kthread_io_clock_wait(struct io_clock *, u64, unsigned long);
void __bch2_increment_clock(struct io_clock *, u64);
static inline void bch2_increment_clock(struct bch_fs *c, u64 sectors,
int rw)
{
struct io_clock *clock = &c->io_clock[rw];
if (unlikely(this_cpu_add_return(*clock->pcpu_buf, sectors) >=
IO_CLOCK_PCPU_SECTORS))
__bch2_increment_clock(clock, this_cpu_xchg(*clock->pcpu_buf, 0));
}
void bch2_io_clock_schedule_timeout(struct io_clock *, u64);
void bch2_io_timers_to_text(struct printbuf *, struct io_clock *);
void bch2_io_clock_exit(struct io_clock *);
int bch2_io_clock_init(struct io_clock *);
#endif /* _BCACHEFS_CLOCK_H */

View File

@ -1,38 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_CLOCK_TYPES_H
#define _BCACHEFS_CLOCK_TYPES_H
#include "util.h"
#define NR_IO_TIMERS (BCH_SB_MEMBERS_MAX * 3)
/*
* Clocks/timers in units of sectors of IO:
*
* Note - they use percpu batching, so they're only approximate.
*/
struct io_timer;
typedef void (*io_timer_fn)(struct io_timer *);
struct io_timer {
io_timer_fn fn;
void *fn2;
u64 expire;
};
/* Amount to buffer up on a percpu counter */
#define IO_CLOCK_PCPU_SECTORS 128
typedef DEFINE_MIN_HEAP(struct io_timer *, io_timer_heap) io_timer_heap;
struct io_clock {
atomic64_t now;
u16 __percpu *pcpu_buf;
unsigned max_slop;
spinlock_t timer_lock;
io_timer_heap timers;
};
#endif /* _BCACHEFS_CLOCK_TYPES_H */

View File

@ -1,773 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include "bcachefs.h"
#include "checksum.h"
#include "compress.h"
#include "error.h"
#include "extents.h"
#include "io_write.h"
#include "opts.h"
#include "super-io.h"
#include <linux/lz4.h>
#include <linux/zlib.h>
#include <linux/zstd.h>
static inline enum bch_compression_opts bch2_compression_type_to_opt(enum bch_compression_type type)
{
switch (type) {
case BCH_COMPRESSION_TYPE_none:
case BCH_COMPRESSION_TYPE_incompressible:
return BCH_COMPRESSION_OPT_none;
case BCH_COMPRESSION_TYPE_lz4_old:
case BCH_COMPRESSION_TYPE_lz4:
return BCH_COMPRESSION_OPT_lz4;
case BCH_COMPRESSION_TYPE_gzip:
return BCH_COMPRESSION_OPT_gzip;
case BCH_COMPRESSION_TYPE_zstd:
return BCH_COMPRESSION_OPT_zstd;
default:
BUG();
}
}
/* Bounce buffer: */
struct bbuf {
void *b;
enum {
BB_NONE,
BB_VMAP,
BB_KMALLOC,
BB_MEMPOOL,
} type;
int rw;
};
static struct bbuf __bounce_alloc(struct bch_fs *c, unsigned size, int rw)
{
void *b;
BUG_ON(size > c->opts.encoded_extent_max);
b = kmalloc(size, GFP_NOFS|__GFP_NOWARN);
if (b)
return (struct bbuf) { .b = b, .type = BB_KMALLOC, .rw = rw };
b = mempool_alloc(&c->compression_bounce[rw], GFP_NOFS);
if (b)
return (struct bbuf) { .b = b, .type = BB_MEMPOOL, .rw = rw };
BUG();
}
static bool bio_phys_contig(struct bio *bio, struct bvec_iter start)
{
struct bio_vec bv;
struct bvec_iter iter;
void *expected_start = NULL;
__bio_for_each_bvec(bv, bio, iter, start) {
if (expected_start &&
expected_start != page_address(bv.bv_page) + bv.bv_offset)
return false;
expected_start = page_address(bv.bv_page) +
bv.bv_offset + bv.bv_len;
}
return true;
}
static struct bbuf __bio_map_or_bounce(struct bch_fs *c, struct bio *bio,
struct bvec_iter start, int rw)
{
struct bbuf ret;
struct bio_vec bv;
struct bvec_iter iter;
unsigned nr_pages = 0;
struct page *stack_pages[16];
struct page **pages = NULL;
void *data;
BUG_ON(start.bi_size > c->opts.encoded_extent_max);
if (!PageHighMem(bio_iter_page(bio, start)) &&
bio_phys_contig(bio, start))
return (struct bbuf) {
.b = page_address(bio_iter_page(bio, start)) +
bio_iter_offset(bio, start),
.type = BB_NONE, .rw = rw
};
/* check if we can map the pages contiguously: */
__bio_for_each_segment(bv, bio, iter, start) {
if (iter.bi_size != start.bi_size &&
bv.bv_offset)
goto bounce;
if (bv.bv_len < iter.bi_size &&
bv.bv_offset + bv.bv_len < PAGE_SIZE)
goto bounce;
nr_pages++;
}
BUG_ON(DIV_ROUND_UP(start.bi_size, PAGE_SIZE) > nr_pages);
pages = nr_pages > ARRAY_SIZE(stack_pages)
? kmalloc_array(nr_pages, sizeof(struct page *), GFP_NOFS)
: stack_pages;
if (!pages)
goto bounce;
nr_pages = 0;
__bio_for_each_segment(bv, bio, iter, start)
pages[nr_pages++] = bv.bv_page;
data = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
if (pages != stack_pages)
kfree(pages);
if (data)
return (struct bbuf) {
.b = data + bio_iter_offset(bio, start),
.type = BB_VMAP, .rw = rw
};
bounce:
ret = __bounce_alloc(c, start.bi_size, rw);
if (rw == READ)
memcpy_from_bio(ret.b, bio, start);
return ret;
}
static struct bbuf bio_map_or_bounce(struct bch_fs *c, struct bio *bio, int rw)
{
return __bio_map_or_bounce(c, bio, bio->bi_iter, rw);
}
static void bio_unmap_or_unbounce(struct bch_fs *c, struct bbuf buf)
{
switch (buf.type) {
case BB_NONE:
break;
case BB_VMAP:
vunmap((void *) ((unsigned long) buf.b & PAGE_MASK));
break;
case BB_KMALLOC:
kfree(buf.b);
break;
case BB_MEMPOOL:
mempool_free(buf.b, &c->compression_bounce[buf.rw]);
break;
}
}
static inline void zlib_set_workspace(z_stream *strm, void *workspace)
{
#ifdef __KERNEL__
strm->workspace = workspace;
#endif
}
static int __bio_uncompress(struct bch_fs *c, struct bio *src,
void *dst_data, struct bch_extent_crc_unpacked crc)
{
struct bbuf src_data = { NULL };
size_t src_len = src->bi_iter.bi_size;
size_t dst_len = crc.uncompressed_size << 9;
void *workspace;
int ret = 0, ret2;
enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type);
mempool_t *workspace_pool = &c->compress_workspace[opt];
if (unlikely(!mempool_initialized(workspace_pool))) {
if (fsck_err(c, compression_type_not_marked_in_sb,
"compression type %s set but not marked in superblock",
__bch2_compression_types[crc.compression_type]))
ret = bch2_check_set_has_compressed_data(c, opt);
else
ret = bch_err_throw(c, compression_workspace_not_initialized);
if (ret)
goto err;
}
src_data = bio_map_or_bounce(c, src, READ);
switch (crc.compression_type) {
case BCH_COMPRESSION_TYPE_lz4_old:
case BCH_COMPRESSION_TYPE_lz4:
ret2 = LZ4_decompress_safe_partial(src_data.b, dst_data,
src_len, dst_len, dst_len);
if (ret2 != dst_len)
ret = bch_err_throw(c, decompress_lz4);
break;
case BCH_COMPRESSION_TYPE_gzip: {
z_stream strm = {
.next_in = src_data.b,
.avail_in = src_len,
.next_out = dst_data,
.avail_out = dst_len,
};
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
zlib_set_workspace(&strm, workspace);
zlib_inflateInit2(&strm, -MAX_WBITS);
ret2 = zlib_inflate(&strm, Z_FINISH);
mempool_free(workspace, workspace_pool);
if (ret2 != Z_STREAM_END)
ret = bch_err_throw(c, decompress_gzip);
break;
}
case BCH_COMPRESSION_TYPE_zstd: {
ZSTD_DCtx *ctx;
size_t real_src_len = le32_to_cpup(src_data.b);
if (real_src_len > src_len - 4) {
ret = bch_err_throw(c, decompress_zstd_src_len_bad);
goto err;
}
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound());
ret2 = zstd_decompress_dctx(ctx,
dst_data, dst_len,
src_data.b + 4, real_src_len);
mempool_free(workspace, workspace_pool);
if (ret2 != dst_len)
ret = bch_err_throw(c, decompress_zstd);
break;
}
default:
BUG();
}
err:
fsck_err:
bio_unmap_or_unbounce(c, src_data);
return ret;
}
int bch2_bio_uncompress_inplace(struct bch_write_op *op,
struct bio *bio)
{
struct bch_fs *c = op->c;
struct bch_extent_crc_unpacked *crc = &op->crc;
struct bbuf data = { NULL };
size_t dst_len = crc->uncompressed_size << 9;
int ret = 0;
/* bio must own its pages: */
BUG_ON(!bio->bi_vcnt);
BUG_ON(DIV_ROUND_UP(crc->live_size, PAGE_SECTORS) > bio->bi_max_vecs);
if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max) {
bch2_write_op_error(op, op->pos.offset,
"extent too big to decompress (%u > %u)",
crc->uncompressed_size << 9, c->opts.encoded_extent_max);
return bch_err_throw(c, decompress_exceeded_max_encoded_extent);
}
data = __bounce_alloc(c, dst_len, WRITE);
ret = __bio_uncompress(c, bio, data.b, *crc);
if (c->opts.no_data_io)
ret = 0;
if (ret) {
bch2_write_op_error(op, op->pos.offset, "%s", bch2_err_str(ret));
goto err;
}
/*
* XXX: don't have a good way to assert that the bio was allocated with
* enough space, we depend on bch2_move_extent doing the right thing
*/
bio->bi_iter.bi_size = crc->live_size << 9;
memcpy_to_bio(bio, bio->bi_iter, data.b + (crc->offset << 9));
crc->csum_type = 0;
crc->compression_type = 0;
crc->compressed_size = crc->live_size;
crc->uncompressed_size = crc->live_size;
crc->offset = 0;
crc->csum = (struct bch_csum) { 0, 0 };
err:
bio_unmap_or_unbounce(c, data);
return ret;
}
int bch2_bio_uncompress(struct bch_fs *c, struct bio *src,
struct bio *dst, struct bvec_iter dst_iter,
struct bch_extent_crc_unpacked crc)
{
struct bbuf dst_data = { NULL };
size_t dst_len = crc.uncompressed_size << 9;
int ret;
if (crc.uncompressed_size << 9 > c->opts.encoded_extent_max ||
crc.compressed_size << 9 > c->opts.encoded_extent_max)
return bch_err_throw(c, decompress_exceeded_max_encoded_extent);
dst_data = dst_len == dst_iter.bi_size
? __bio_map_or_bounce(c, dst, dst_iter, WRITE)
: __bounce_alloc(c, dst_len, WRITE);
ret = __bio_uncompress(c, src, dst_data.b, crc);
if (ret)
goto err;
if (dst_data.type != BB_NONE &&
dst_data.type != BB_VMAP)
memcpy_to_bio(dst, dst_iter, dst_data.b + (crc.offset << 9));
err:
bio_unmap_or_unbounce(c, dst_data);
return ret;
}
static int attempt_compress(struct bch_fs *c,
void *workspace,
void *dst, size_t dst_len,
void *src, size_t src_len,
struct bch_compression_opt compression)
{
enum bch_compression_type compression_type =
__bch2_compression_opt_to_type[compression.type];
switch (compression_type) {
case BCH_COMPRESSION_TYPE_lz4:
if (compression.level < LZ4HC_MIN_CLEVEL) {
int len = src_len;
int ret = LZ4_compress_destSize(
src, dst,
&len, dst_len,
workspace);
if (len < src_len)
return -len;
return ret;
} else {
int ret = LZ4_compress_HC(
src, dst,
src_len, dst_len,
compression.level,
workspace);
return ret ?: -1;
}
case BCH_COMPRESSION_TYPE_gzip: {
z_stream strm = {
.next_in = src,
.avail_in = src_len,
.next_out = dst,
.avail_out = dst_len,
};
zlib_set_workspace(&strm, workspace);
if (zlib_deflateInit2(&strm,
compression.level
? clamp_t(unsigned, compression.level,
Z_BEST_SPEED, Z_BEST_COMPRESSION)
: Z_DEFAULT_COMPRESSION,
Z_DEFLATED, -MAX_WBITS, DEF_MEM_LEVEL,
Z_DEFAULT_STRATEGY) != Z_OK)
return 0;
if (zlib_deflate(&strm, Z_FINISH) != Z_STREAM_END)
return 0;
if (zlib_deflateEnd(&strm) != Z_OK)
return 0;
return strm.total_out;
}
case BCH_COMPRESSION_TYPE_zstd: {
/*
* rescale:
* zstd max compression level is 22, our max level is 15
*/
unsigned level = min((compression.level * 3) / 2, zstd_max_clevel());
ZSTD_parameters params = zstd_get_params(level, c->opts.encoded_extent_max);
ZSTD_CCtx *ctx = zstd_init_cctx(workspace, c->zstd_workspace_size);
/*
* ZSTD requires that when we decompress we pass in the exact
* compressed size - rounding it up to the nearest sector
* doesn't work, so we use the first 4 bytes of the buffer for
* that.
*
* Additionally, the ZSTD code seems to have a bug where it will
* write just past the end of the buffer - so subtract a fudge
* factor (7 bytes) from the dst buffer size to account for
* that.
*/
size_t len = zstd_compress_cctx(ctx,
dst + 4, dst_len - 4 - 7,
src, src_len,
&params);
if (zstd_is_error(len))
return 0;
*((__le32 *) dst) = cpu_to_le32(len);
return len + 4;
}
default:
BUG();
}
}
static unsigned __bio_compress(struct bch_fs *c,
struct bio *dst, size_t *dst_len,
struct bio *src, size_t *src_len,
struct bch_compression_opt compression)
{
struct bbuf src_data = { NULL }, dst_data = { NULL };
void *workspace;
enum bch_compression_type compression_type =
__bch2_compression_opt_to_type[compression.type];
unsigned pad;
int ret = 0;
/* bch2_compression_decode catches unknown compression types: */
BUG_ON(compression.type >= BCH_COMPRESSION_OPT_NR);
mempool_t *workspace_pool = &c->compress_workspace[compression.type];
if (unlikely(!mempool_initialized(workspace_pool))) {
if (fsck_err(c, compression_opt_not_marked_in_sb,
"compression opt %s set but not marked in superblock",
bch2_compression_opts[compression.type])) {
ret = bch2_check_set_has_compressed_data(c, compression.type);
if (ret) /* memory allocation failure, don't compress */
return 0;
} else {
return 0;
}
}
/* If it's only one block, don't bother trying to compress: */
if (src->bi_iter.bi_size <= c->opts.block_size)
return BCH_COMPRESSION_TYPE_incompressible;
dst_data = bio_map_or_bounce(c, dst, WRITE);
src_data = bio_map_or_bounce(c, src, READ);
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
*src_len = src->bi_iter.bi_size;
*dst_len = dst->bi_iter.bi_size;
/*
* XXX: this algorithm sucks when the compression code doesn't tell us
* how much would fit, like LZ4 does:
*/
while (1) {
if (*src_len <= block_bytes(c)) {
ret = -1;
break;
}
ret = attempt_compress(c, workspace,
dst_data.b, *dst_len,
src_data.b, *src_len,
compression);
if (ret > 0) {
*dst_len = ret;
ret = 0;
break;
}
/* Didn't fit: should we retry with a smaller amount? */
if (*src_len <= *dst_len) {
ret = -1;
break;
}
/*
* If ret is negative, it's a hint as to how much data would fit
*/
BUG_ON(-ret >= *src_len);
if (ret < 0)
*src_len = -ret;
else
*src_len -= (*src_len - *dst_len) / 2;
*src_len = round_down(*src_len, block_bytes(c));
}
mempool_free(workspace, workspace_pool);
if (ret)
goto err;
/* Didn't get smaller: */
if (round_up(*dst_len, block_bytes(c)) >= *src_len)
goto err;
pad = round_up(*dst_len, block_bytes(c)) - *dst_len;
memset(dst_data.b + *dst_len, 0, pad);
*dst_len += pad;
if (dst_data.type != BB_NONE &&
dst_data.type != BB_VMAP)
memcpy_to_bio(dst, dst->bi_iter, dst_data.b);
BUG_ON(!*dst_len || *dst_len > dst->bi_iter.bi_size);
BUG_ON(!*src_len || *src_len > src->bi_iter.bi_size);
BUG_ON(*dst_len & (block_bytes(c) - 1));
BUG_ON(*src_len & (block_bytes(c) - 1));
ret = compression_type;
out:
bio_unmap_or_unbounce(c, src_data);
bio_unmap_or_unbounce(c, dst_data);
return ret;
err:
ret = BCH_COMPRESSION_TYPE_incompressible;
goto out;
fsck_err:
ret = 0;
goto out;
}
unsigned bch2_bio_compress(struct bch_fs *c,
struct bio *dst, size_t *dst_len,
struct bio *src, size_t *src_len,
unsigned compression_opt)
{
unsigned orig_dst = dst->bi_iter.bi_size;
unsigned orig_src = src->bi_iter.bi_size;
unsigned compression_type;
/* Don't consume more than BCH_ENCODED_EXTENT_MAX from @src: */
src->bi_iter.bi_size = min_t(unsigned, src->bi_iter.bi_size,
c->opts.encoded_extent_max);
/* Don't generate a bigger output than input: */
dst->bi_iter.bi_size = min(dst->bi_iter.bi_size, src->bi_iter.bi_size);
compression_type =
__bio_compress(c, dst, dst_len, src, src_len,
bch2_compression_decode(compression_opt));
dst->bi_iter.bi_size = orig_dst;
src->bi_iter.bi_size = orig_src;
return compression_type;
}
static int __bch2_fs_compress_init(struct bch_fs *, u64);
#define BCH_FEATURE_none 0
static const unsigned bch2_compression_opt_to_feature[] = {
#define x(t, n) [BCH_COMPRESSION_OPT_##t] = BCH_FEATURE_##t,
BCH_COMPRESSION_OPTS()
#undef x
};
#undef BCH_FEATURE_none
static int __bch2_check_set_has_compressed_data(struct bch_fs *c, u64 f)
{
int ret = 0;
if ((c->sb.features & f) == f)
return 0;
mutex_lock(&c->sb_lock);
if ((c->sb.features & f) == f) {
mutex_unlock(&c->sb_lock);
return 0;
}
ret = __bch2_fs_compress_init(c, c->sb.features|f);
if (ret) {
mutex_unlock(&c->sb_lock);
return ret;
}
c->disk_sb.sb->features[0] |= cpu_to_le64(f);
bch2_write_super(c);
mutex_unlock(&c->sb_lock);
return 0;
}
int bch2_check_set_has_compressed_data(struct bch_fs *c,
unsigned compression_opt)
{
unsigned compression_type = bch2_compression_decode(compression_opt).type;
BUG_ON(compression_type >= ARRAY_SIZE(bch2_compression_opt_to_feature));
return compression_type
? __bch2_check_set_has_compressed_data(c,
1ULL << bch2_compression_opt_to_feature[compression_type])
: 0;
}
void bch2_fs_compress_exit(struct bch_fs *c)
{
unsigned i;
for (i = 0; i < ARRAY_SIZE(c->compress_workspace); i++)
mempool_exit(&c->compress_workspace[i]);
mempool_exit(&c->compression_bounce[WRITE]);
mempool_exit(&c->compression_bounce[READ]);
}
static int __bch2_fs_compress_init(struct bch_fs *c, u64 features)
{
ZSTD_parameters params = zstd_get_params(zstd_max_clevel(),
c->opts.encoded_extent_max);
c->zstd_workspace_size = zstd_cctx_workspace_bound(&params.cParams);
struct {
unsigned feature;
enum bch_compression_opts type;
size_t compress_workspace;
} compression_types[] = {
{ BCH_FEATURE_lz4, BCH_COMPRESSION_OPT_lz4,
max_t(size_t, LZ4_MEM_COMPRESS, LZ4HC_MEM_COMPRESS) },
{ BCH_FEATURE_gzip, BCH_COMPRESSION_OPT_gzip,
max(zlib_deflate_workspacesize(MAX_WBITS, DEF_MEM_LEVEL),
zlib_inflate_workspacesize()) },
{ BCH_FEATURE_zstd, BCH_COMPRESSION_OPT_zstd,
max(c->zstd_workspace_size,
zstd_dctx_workspace_bound()) },
}, *i;
bool have_compressed = false;
for (i = compression_types;
i < compression_types + ARRAY_SIZE(compression_types);
i++)
have_compressed |= (features & (1 << i->feature)) != 0;
if (!have_compressed)
return 0;
if (!mempool_initialized(&c->compression_bounce[READ]) &&
mempool_init_kvmalloc_pool(&c->compression_bounce[READ],
1, c->opts.encoded_extent_max))
return bch_err_throw(c, ENOMEM_compression_bounce_read_init);
if (!mempool_initialized(&c->compression_bounce[WRITE]) &&
mempool_init_kvmalloc_pool(&c->compression_bounce[WRITE],
1, c->opts.encoded_extent_max))
return bch_err_throw(c, ENOMEM_compression_bounce_write_init);
for (i = compression_types;
i < compression_types + ARRAY_SIZE(compression_types);
i++) {
if (!(features & (1 << i->feature)))
continue;
if (mempool_initialized(&c->compress_workspace[i->type]))
continue;
if (mempool_init_kvmalloc_pool(
&c->compress_workspace[i->type],
1, i->compress_workspace))
return bch_err_throw(c, ENOMEM_compression_workspace_init);
}
return 0;
}
static u64 compression_opt_to_feature(unsigned v)
{
unsigned type = bch2_compression_decode(v).type;
return BIT_ULL(bch2_compression_opt_to_feature[type]);
}
int bch2_fs_compress_init(struct bch_fs *c)
{
u64 f = c->sb.features;
f |= compression_opt_to_feature(c->opts.compression);
f |= compression_opt_to_feature(c->opts.background_compression);
return __bch2_fs_compress_init(c, f);
}
int bch2_opt_compression_parse(struct bch_fs *c, const char *_val, u64 *res,
struct printbuf *err)
{
char *val = kstrdup(_val, GFP_KERNEL);
char *p = val, *type_str, *level_str;
struct bch_compression_opt opt = { 0 };
int ret;
if (!val)
return -ENOMEM;
type_str = strsep(&p, ":");
level_str = p;
ret = match_string(bch2_compression_opts, -1, type_str);
if (ret < 0 && err)
prt_printf(err, "invalid compression type\n");
if (ret < 0)
goto err;
opt.type = ret;
if (level_str) {
unsigned level;
ret = kstrtouint(level_str, 10, &level);
if (!ret && !opt.type && level)
ret = -EINVAL;
if (!ret && level > 15)
ret = -EINVAL;
if (ret < 0 && err)
prt_printf(err, "invalid compression level\n");
if (ret < 0)
goto err;
opt.level = level;
}
*res = bch2_compression_encode(opt);
err:
kfree(val);
return ret;
}
void bch2_compression_opt_to_text(struct printbuf *out, u64 v)
{
struct bch_compression_opt opt = bch2_compression_decode(v);
if (opt.type < BCH_COMPRESSION_OPT_NR)
prt_str(out, bch2_compression_opts[opt.type]);
else
prt_printf(out, "(unknown compression opt %u)", opt.type);
if (opt.level)
prt_printf(out, ":%u", opt.level);
}
void bch2_opt_compression_to_text(struct printbuf *out,
struct bch_fs *c,
struct bch_sb *sb,
u64 v)
{
return bch2_compression_opt_to_text(out, v);
}
int bch2_opt_compression_validate(u64 v, struct printbuf *err)
{
if (!bch2_compression_opt_valid(v)) {
prt_printf(err, "invalid compression opt %llu", v);
return -BCH_ERR_invalid_sb_opt_compression;
}
return 0;
}

View File

@ -1,73 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BCACHEFS_COMPRESS_H
#define _BCACHEFS_COMPRESS_H
#include "extents_types.h"
static const unsigned __bch2_compression_opt_to_type[] = {
#define x(t, n) [BCH_COMPRESSION_OPT_##t] = BCH_COMPRESSION_TYPE_##t,
BCH_COMPRESSION_OPTS()
#undef x
};
struct bch_compression_opt {
u8 type:4,
level:4;
};
static inline struct bch_compression_opt __bch2_compression_decode(unsigned v)
{
return (struct bch_compression_opt) {
.type = v & 15,
.level = v >> 4,
};
}
static inline bool bch2_compression_opt_valid(unsigned v)
{
struct bch_compression_opt opt = __bch2_compression_decode(v);
return opt.type < ARRAY_SIZE(__bch2_compression_opt_to_type) && !(!opt.type && opt.level);
}
static inline struct bch_compression_opt bch2_compression_decode(unsigned v)
{
return bch2_compression_opt_valid(v)
? __bch2_compression_decode(v)
: (struct bch_compression_opt) { 0 };
}
static inline unsigned bch2_compression_encode(struct bch_compression_opt opt)
{
return opt.type|(opt.level << 4);
}
static inline enum bch_compression_type bch2_compression_opt_to_type(unsigned v)
{
return __bch2_compression_opt_to_type[bch2_compression_decode(v).type];
}
struct bch_write_op;
int bch2_bio_uncompress_inplace(struct bch_write_op *, struct bio *);
int bch2_bio_uncompress(struct bch_fs *, struct bio *, struct bio *,
struct bvec_iter, struct bch_extent_crc_unpacked);
unsigned bch2_bio_compress(struct bch_fs *, struct bio *, size_t *,
struct bio *, size_t *, unsigned);
int bch2_check_set_has_compressed_data(struct bch_fs *, unsigned);
void bch2_fs_compress_exit(struct bch_fs *);
int bch2_fs_compress_init(struct bch_fs *);
void bch2_compression_opt_to_text(struct printbuf *, u64);
int bch2_opt_compression_parse(struct bch_fs *, const char *, u64 *, struct printbuf *);
void bch2_opt_compression_to_text(struct printbuf *, struct bch_fs *, struct bch_sb *, u64);
int bch2_opt_compression_validate(u64, struct printbuf *);
#define bch2_opt_compression (struct bch_opt_fn) { \
.parse = bch2_opt_compression_parse, \
.to_text = bch2_opt_compression_to_text, \
.validate = bch2_opt_compression_validate, \
}
#endif /* _BCACHEFS_COMPRESS_H */

View File

@ -1,38 +0,0 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/log2.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include "darray.h"
int __bch2_darray_resize_noprof(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
{
if (new_size > d->size) {
new_size = roundup_pow_of_two(new_size);
/*
* This is a workaround: kvmalloc() doesn't support > INT_MAX
* allocations, but vmalloc() does.
* The limit needs to be lifted from kvmalloc, and when it does
* we'll go back to just using that.
*/
size_t bytes;
if (unlikely(check_mul_overflow(new_size, element_size, &bytes)))
return -ENOMEM;
void *data = likely(bytes < INT_MAX)
? kvmalloc_noprof(bytes, gfp)
: vmalloc_noprof(bytes);
if (!data)
return -ENOMEM;
if (d->size)
memcpy(data, d->data, d->size * element_size);
if (d->data != d->preallocated)
kvfree(d->data);
d->data = data;
d->size = new_size;
}
return 0;
}

Some files were not shown because too many files have changed in this diff Show More