Commit 7dbec0bb authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-6.18/dm-changes' of...

Merge tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mikulas Patocka:

 - a new dm-pcache target for read/write caching on persistent memory

 - fix typos in docs

 - misc small refactoring

 - mark dm-error with DM_TARGET_PASSES_INTEGRITY

 - dm-request-based: fix NULL pointer dereference and quiesce_depth out of sync

 - dm-linear: optimize REQ_PREFLUSH

 - dm-vdo: return error on corrupted metadata

 - dm-integrity: support asynchronous hash interface

* tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits)
  dm raid: use proper md_ro_state enumerators
  dm-integrity: prefer synchronous hash interface
  dm-integrity: enable asynchronous hash interface
  dm-integrity: rename internal_hash
  dm-integrity: add the "offset" argument
  dm-integrity: allocate the recalculate buffer with kmalloc
  dm-integrity: introduce integrity_kmap and integrity_kunmap
  dm-integrity: replace bvec_kmap_local with kmap_local_page
  dm-integrity: use internal variable for digestsize
  dm vdo: return error on corrupted metadata in start_restoring_volume functions
  dm vdo: Update code to use mem_is_zero
  dm: optimize REQ_PREFLUSH with data when using the linear target
  dm-pcache: use int type to store negative error codes
  dm: fix "writen"->"written"
  dm-pcache: cleanup: fix coding style report by checkpatch.pl
  dm-pcache: remove ctrl_lock for pcache_cache_segment
  dm: fix NULL pointer dereference in __dm_suspend()
  dm: fix queue start/stop imbalance under suspend/load/resume races
  dm-pcache: add persistent cache target in device-mapper
  dm error: mark as DM_TARGET_PASSES_INTEGRITY
  ...
parents 2ccb4d20 55dcfdf8
Loading
Loading
Loading
Loading
+4 −4
Original line number Diff line number Diff line
@@ -3,7 +3,7 @@ dm-delay
========

Device-Mapper's "delay" target delays reads and/or writes
and/or flushs and optionally maps them to different devices.
and/or flushes and optionally maps them to different devices.

Arguments::

@@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments:
   to write and flush operations on optionally different write_device with
   optionally different sector offset

9: same as 6 arguments plus define flush_offset and flush_delay explicitely
9: same as 6 arguments plus define flush_offset and flush_delay explicitly
   on/with optionally different flush_device/flush_offset.

Offsets are specified in sectors.
@@ -40,7 +40,7 @@ Example scripts
	#!/bin/sh
	#
	# Create mapped device delaying write and flush operations for 400ms and
	# splitting reads to device $1 but writes and flushs to different device $2
	# splitting reads to device $1 but writes and flushes to different device $2
	# to different offsets of 2048 and 4096 sectors respectively.
	#
	dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400"
@@ -48,7 +48,7 @@ Example scripts
::
	#!/bin/sh
	#
	# Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms
	# Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms
	# onto the same backing device at offset 0 sectors.
	#
	dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333"
+202 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=================================
dm-pcache — Persistent Cache
=================================

*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*

This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device.  The code lives in `drivers/md/dm-pcache/`.

Quick feature summary
=====================

* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
  == 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
* Pure *DAX path* I/O – no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency


Constructor
===========

::

    pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]

=========================  ====================================================
``cache_dev``               Any DAX-capable block device (``/dev/pmem0``…).
                            All metadata *and* cached blocks are stored here.

``backing_dev``             The slow block device to be cached.

``cache_mode``              Optional, Only ``writeback`` is accepted at the
                            moment.

``data_crc``                Optional, default to ``false``

                            * ``true``  – store CRC32 for every cached entry
			      and verify on reads
                            * ``false`` – skip CRC (faster)
=========================  ====================================================

Example
-------

.. code-block:: shell

   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).


Status line
===========

``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:

::

   <sb_flags> <seg_total> <cache_segs> <segs_used> \
   <gc_percent> <cache_flags> \
   <key_head_seg>:<key_head_off> \
   <dirty_tail_seg>:<dirty_tail_off> \
   <key_tail_seg>:<key_tail_off>

Field meanings
--------------

===============================  =============================================
``sb_flags``                     Super-block flags (e.g. endian marker).

``seg_total``                    Number of physical *pmem* segments.

``cache_segs``                   Number of segments used for cache.

``segs_used``                    Segments currently allocated (bitmap weight).

``gc_percent``                   Current GC high-water mark (0-90).

``cache_flags``                  Bit 0 – DATA_CRC enabled
                                 Bit 1 – INIT_DONE (cache initialised)
                                 Bits 2-5 – cache mode (0 == WB).

``key_head``                     Where new key-sets are being written.

``dirty_tail``                   First dirty key-set that still needs
                                 write-back to the backing device.

``key_tail``                     First key-set that may be reclaimed by GC.
===============================  =============================================


Messages
========

*Change GC trigger*

::

   dmsetup message <dev> 0 gc_percent <0-90>


Theory of operation
===================

Sub-devices
-----------

====================  =========================================================
backing_dev             Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev               DAX device; must expose direct-access memory.
====================  =========================================================

Segments and key-sets
---------------------

* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
  inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
  and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.

Write-back
----------

Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*.  A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.

Garbage collection
------------------

GC starts when ``segs_used >= seg_total * gc_percent / 100``.  It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.

CRC verification
----------------

If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key.  Reads
validate the CRC before copying to the caller.


Failure handling
================

* *pmem media errors* – all metadata copies are read with
  ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
  dm-pcache retries internally (request deferral).
* *System crash* – on attach, the driver replays ksets from *key_tail* to
  rebuild the in-core trees; every segment’s generation guards against
  use-after-free keys.


Limitations & TODO
==================

* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.


Example workflow
================

.. code-block:: shell

   # 1.  Create devices
   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

   # 2.  Put a filesystem on top
   mkfs.ext4 /dev/mapper/pcache_sdb
   mount /dev/mapper/pcache_sdb /mnt

   # 3.  Tune GC threshold to 80 %
   dmsetup message pcache_sdb 0 gc_percent 80

   # 4.  Observe status
   watch -n1 'dmsetup status pcache_sdb'

   # 5.  Shutdown
   umount /mnt
   dmsetup remove pcache_sdb


``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!
+1 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ Device Mapper
    dm-integrity
    dm-io
    dm-log
    dm-pcache
    dm-queue-length
    dm-raid
    dm-service-time
+1 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0-only

======
dm-vdo
======

+8 −0
Original line number Diff line number Diff line
@@ -7133,6 +7133,14 @@ S: Maintained
F:	Documentation/admin-guide/device-mapper/vdo*.rst
F:	drivers/md/dm-vdo/
DEVICE-MAPPER PCACHE TARGET
M:	Dongsheng Yang <dongsheng.yang@linux.dev>
M:	Zheng Gu <cengku@gmail.com>
L:	dm-devel@lists.linux.dev
S:	Maintained
F:	Documentation/admin-guide/device-mapper/dm-pcache.rst
F:	drivers/md/dm-pcache/
DEVLINK
M:	Jiri Pirko <jiri@resnulli.us>
L:	netdev@vger.kernel.org
Loading