Commit 36e9f71b authored by Gregory Price's avatar Gregory Price Committed by Dave Jiang
Browse files

cxl: docs/linux/dax-driver documentation



Add documentation on how the CXL driver interacts with the DAX driver.

Signed-off-by: default avatarGregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20250512162134.3596150-12-gourry@gourry.net


Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
parent ef3a43a6
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -36,6 +36,7 @@ that have impacts on each other. The docs here break up configurations steps.
   linux/overview
   linux/early-boot
   linux/cxl-driver
   linux/dax-driver
   linux/access-coordinates


+105 −10
Original line number Diff line number Diff line
@@ -34,6 +34,32 @@ into a single memory region. The memory region has been converted to dax. ::
    decoder1.0   decoder5.0  endpoint5   port1  region0
    decoder2.0   decoder5.1  endpoint6   port2  root0


.. kernel-render:: DOT
   :alt: Digraph of CXL fabric describing host-bridge interleaving
   :caption: Diagraph of CXL fabric with a host-bridge interleave memory region

   digraph foo {
     "root0" -> "port1";
     "root0" -> "port3";
     "root0" -> "decoder0.0";
     "port1" -> "endpoint5";
     "port3" -> "endpoint6";
     "port1" -> "decoder1.0";
     "port3" -> "decoder3.0";
     "endpoint5" -> "decoder5.0";
     "endpoint6" -> "decoder6.0";
     "decoder0.0" -> "region0";
     "decoder0.0" -> "decoder1.0";
     "decoder0.0" -> "decoder3.0";
     "decoder1.0" -> "decoder5.0";
     "decoder3.0" -> "decoder6.0";
     "decoder5.0" -> "region0";
     "decoder6.0" -> "region0";
     "region0" -> "dax_region0";
     "dax_region0" -> "dax0.0";
   }

For this section we'll explore the devices present in this configuration, but
we'll explore more configurations in-depth in example configurations below.

@@ -41,7 +67,7 @@ Base Devices
------------
Most devices in a CXL fabric are a `port` of some kind (because each
device mostly routes request from one device to the next, rather than
provide a manageable service).
provide a direct service).

Root
~~~~
@@ -53,6 +79,8 @@ The Root contains links to:

* `Host Bridge Ports` defined by ACPI CEDT CHBS.

* `Downstream Ports` typically connected to `Host Bridge Ports`.

* `Root Decoders` defined by ACPI CEDT CFMWS.

::
@@ -150,6 +178,27 @@ device configuration data. ::
    driver    label_storage_size  pmem         serial
    firmware  numa_node           ram          subsystem

A Memory Device is a discrete base object that is not a port.  While the
physical device it belongs to may also host an `endpoint`, the relationship
between an `endpoint` and a `memdev` is not captured in sysfs.

Port Relationships
~~~~~~~~~~~~~~~~~~
In our example described above, there are four host bridges attached to the
root, and two of the host bridges have one endpoint attached.

.. kernel-render:: DOT
   :alt: Digraph of CXL fabric describing host-bridge interleaving
   :caption: Diagraph of CXL fabric with a host-bridge interleave memory region

   digraph foo {
     "root0"    -> "port1";
     "root0"    -> "port2";
     "root0"    -> "port3";
     "root0"    -> "port4";
     "port1" -> "endpoint5";
     "port3" -> "endpoint6";
   }

Decoders
--------
@@ -322,6 +371,29 @@ settings (granularity and ways must be the same).
Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the
:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers.

Decoder Relationships
~~~~~~~~~~~~~~~~~~~~~
In our example described above, there is one root decoder which routes memory
accesses over two host bridges.  Each host bridge has a decoder which routes
access to their singular endpoint targets.  Each endpoint has a decoder which
translates HPA to DPA and services the memory request.

The driver validates relationships between ports by decoder programming, so
we can think of decoders being related in a similarly hierarchical fashion to
ports.

.. kernel-render:: DOT
   :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders.
   :caption: Diagraph of CXL root, switch, and endpoint decoders.

   digraph foo {
     "root0"    -> "decoder0.0";
     "decoder0.0" -> "decoder1.0";
     "decoder0.0" -> "decoder3.0";
     "decoder1.0" -> "decoder5.0";
     "decoder3.0" -> "decoder6.0";
   }

Regions
-------

@@ -348,6 +420,17 @@ The interleave settings in a `Memory Region` describe the configuration of the
`Interleave Set` - and are what can be expected to be seen in the endpoint
interleave settings.

.. kernel-render:: DOT
   :alt: Digraph of CXL memory region relationships between root and endpoint decoders.
   :caption: Regions are created based on root decoder configurations. Endpoint decoders
             must be programmed with the same interleave settings as the region.

   digraph foo {
     "root0"    -> "decoder0.0";
     "decoder0.0" -> "region0";
     "region0" -> "decoder5.0";
     "region0" -> "decoder6.0";
   }

DAX Region
~~~~~~~~~~
@@ -360,7 +443,6 @@ for more details. ::
    dax0.0      devtype  modalias   uevent
    dax_region  driver   subsystem


Mailbox Interfaces
------------------
A mailbox command interface for each device is exposed in ::
@@ -418,17 +500,30 @@ the relationships between a decoder and it's parent.

For example, in a `Cross-Link First` interleave setup with 16 endpoints
attached to 4 host bridges, linux expects the following ways/granularity
across the root, host bridge, and endpoints respectively. ::
across the root, host bridge, and endpoints respectively.

.. flat-table:: 4x4 cross-link first interleave settings

  * - decoder
    - ways
    - granularity

                   ways   granularity
  root              4        256
  host bridge       4       1024
  endpoint         16        256
  * - root
    - 4
    - 256

  * - host bridge
    - 4
    - 1024

  * - endpoint
    - 16
    - 256

At the root, every a given access will be routed to the
:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every
:code:`((HPA / 1024) % 4)th` target endpoint.  Each endpoint will translate
the access based on the entire 16 device interleave set.
:code:`((HPA / 1024) % 4)th` target endpoint.  Each endpoint translates based
on the entire 16 device interleave set.

Unbalanced interleave sets are not supported - decoders at a similar point
in the hierarchy (e.g. all host bridge decoders) must have the same ways and
@@ -467,7 +562,7 @@ In this example, the CFMWS defines two discrete non-interleaved 4GB regions
for each host bridge, and one interleaved 8GB region that targets both. This
would result in 3 root decoders presenting in the root. ::

  # ls /sys/bus/cxl/devices/root0
  # ls /sys/bus/cxl/devices/root0/decoder*
    decoder0.0  decoder0.1  decoder0.2

  # cat /sys/bus/cxl/devices/decoder0.0/target_list start size
+43 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

====================
DAX Driver Operation
====================
The `Direct Access Device` driver was originally designed to provide a
memory-like access mechanism to memory-like block-devices.  It was
extended to support CXL Memory Devices, which provide user-configured
memory devices.

The CXL subsystem depends on the DAX subsystem to either:

- Generate a file-like interface to userland via :code:`/dev/daxN.Y`, or
- Engage the memory-hotplug interface to add CXL memory to page allocator.

The DAX subsystem exposes this ability through the `cxl_dax_region` driver.
A `dax_region` provides the translation between a CXL `memory_region` and
a `DAX Device`.

DAX Device
==========
A `DAX Device` is a file-like interface exposed in :code:`/dev/daxN.Y`. A
memory region exposed via dax device can be accessed via userland software
via the :code:`mmap()` system-call.  The result is direct mappings to the
CXL capacity in the task's page tables.

Users wishing to manually handle allocation of CXL memory should use this
interface.

kmem conversion
===============
The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug
memory blocks` managed by :code:`kernel/memory-hotplug.c`.  This capacity
will be exposed to the kernel page allocator in the user-selected memory
zone.

The :code:`memmap_on_memory` setting (both global and DAX device local)
dictates where the kernell will allocate the :code:`struct folio` descriptors
for this memory will come from.  If :code:`memmap_on_memory` is set, memory
hotplug will set aside a portion of the memory block capacity to allocate
folios. If unset, the memory is allocated via a normal :code:`GFP_KERNEL`
allocation - and as a result will most likely land on the local NUM node of the
CPU executing the hotplug operation.