Commit 58dfd959 authored by Dave Jiang's avatar Dave Jiang
Browse files

Merge branch 'for-6.16/cxl-docs' into cxl-for-next

Detailed documentation for the entire CXL sub-system from platform, BIOS,
to CXL driver, memory interface, memory hotplug, and others.
parents a223ce19 dba600d0
Loading
Loading
Loading
Loading
+60 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===========
DAX Devices
===========
CXL capacity exposed as a DAX device can be accessed directly via mmap.
Users may wish to use this interface mechanism to write their own userland
CXL allocator, or to managed shared or persistent memory regions across multiple
hosts.

If the capacity is shared across hosts or persistent, appropriate flushing
mechanisms must be employed unless the region supports Snoop Back-Invalidate.

Note that mappings must be aligned (size and base) to the dax device's base
alignment, which is typically 2MB - but maybe be configured larger.

::

  #include <stdio.h>
  #include <stdlib.h>
  #include <stdint.h>
  #include <sys/mman.h>
  #include <fcntl.h>
  #include <unistd.h>

  #define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
  #define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB

  int main() {
      int fd;
      void* mapped_addr;

      /* Open the DAX device */
      fd = open(DEVICE_PATH, O_RDWR);
      if (fd < 0) {
          perror("open");
          return -1;
      }

      /* Map the device into memory */
      mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE,
                         MAP_SHARED, fd, 0);
      if (mapped_addr == MAP_FAILED) {
          perror("mmap");
          close(fd);
          return -1;
      }

      printf("Mapped address: %p\n", mapped_addr);

      /* You can now access the device through the mapped address */
      uint64_t* ptr = (uint64_t*)mapped_addr;
      *ptr = 0x1234567890abcdef; // Write a value to the device
      printf("Value at address %p: 0x%016llx\n", ptr, *ptr);

      /* Clean up */
      munmap(mapped_addr, DEVICE_SIZE);
      close(fd);
      return 0;
  }
+32 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

==========
Huge Pages
==========

Contiguous Memory Allocator
===========================
CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
as the NUMA node hosting that capacity will be `Online` at the time CMA
carves out contiguous capacity.

CXL Memory deferred to the CXL Driver for configuration cannot have its
capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
at :code:`__init` time - when CMA carves out contiguous capacity.

HugeTLB
=======
Different huge page sizes allow different memory configurations.

2MB Huge Pages
--------------
All CXL capacity regardless of configuration time or memory zone is eligible
for use as 2MB huge pages.

1GB Huge Pages
--------------
CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
allocation.

CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
Page allocation.
+85 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

==================
The Page Allocator
==================

The kernel page allocator services all general page allocation requests, such
as :code:`kmalloc`.  CXL configuration steps affect the behavior of the page
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
placed in.

This section mostly focuses on how these configurations affect the page
allocator (as of Linux v6.15) rather than the overall page allocator behavior.

NUMA nodes and mempolicy
========================
Unless a task explicitly registers a mempolicy, the default memory policy
of the linux kernel is to allocate memory from the `local NUMA node` first,
and fall back to other nodes only if the local node is pressured.

Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
with the CXL memory being non-local.  Technically, however, it is possible
for a compute node to have no local DRAM, and for CXL memory to be the
`local` capacity for that compute node.


Memory Zones
============
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.

As of v6.15, the page allocator attempts to allocate from the highest
available and compatible ZONE for an allocation from the local node first.

An example of a `zone incompatibility` is attempting to service an allocation
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`.  Kernel allocations are
typically not migratable, and as a result can only be serviced from
:code:`ZONE_NORMAL` or lower.

To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
will fallback to allocate from :code:`ZONE_NORMAL`.


Zone and Node Quirks
====================
Let's consider a configuration where the local DRAM capacity is largely onlined
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
CXL capacity has the opposite configuration - all onlined in
:code:`ZONE_MOVABLE`.

Under the default allocation policy, the page allocator will completely skip
:code:`ZONE_MOVABLE` as a valid allocation target.  This is because, as of
Linux v6.15, the page allocator does (approximately) the following: ::

  for (each zone in local_node):

    for (each node in fallback_order):

      attempt_allocation(gfp_flags);

Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
functionally unreachable for direct allocation.  As a result, the only way
for CXL capacity to be used is via `demotion` in the reclaim path.

This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
capacity - when that capacity is depleted, the page allocator will actually
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.

We may wish to invert this priority in future Linux versions.

If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
when the DRAM nodes are depleted. See the reclaim section for more details.


CGroups and CPUSets
===================
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
containers to limit the accessibility of certain NUMA nodes for tasks in that
container.  Users may wish to utilize this in multi-tenant systems where some
tasks prefer not to use slower memory.

In the reclaim section we'll discuss some limitations of this interface to
prevent demotions of shared data to CXL memory (if demotions are enabled).
+51 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=======
Reclaim
=======
Another way CXL memory can be utilized *indirectly* is via the reclaim system
in :code:`mm/vmscan.c`.  Reclaim is engaged when memory capacity on the system
becomes pressured based on global and cgroup-local `watermark` settings.

In this section we won't discuss the `watermark` configurations, just how CXL
memory can be consumed by various pieces of reclaim system.

Demotion
========
By default, the reclaim system will prefer swap (or zswap) when reclaiming
memory.  Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
is available.

Demotion engages the :code:`mm/memory_tier.c` component to determine the
next demotion node.  The next demotion node is based on the :code:`HMAT`
or :code:`CDAT` performance data.

cpusets.mems_allowed quirk
--------------------------
In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
when migrating pages.  As a result, if demotion is enabled, vmscan cannot
guarantee isolation of a container's memory from nodes not set in mems_allowed.

In Linux v6.XX and up, demotion does attempt to respect
:code:`cpusets.mems_allowed`; however, certain classes of shared memory
originally instantiated by another cgroup (such as common libraries - e.g.
libc) may still be demoted.  As a result, the mems_allowed interface still
cannot provide perfect isolation from the remote nodes.

ZSwap and Node Preference
=========================
In Linux v6.15 and below, ZSwap allocates memory from the local node of the
processor for the new pages being compressed.  Since pages being compressed
are typically cold, the result is a cold page becomes promoted - only to
be later demoted as it ages off the LRU.

In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
as the allocation target for the compression page.  This helps prevent
thrashing.

Demotion with ZSwap
===================
When enabling both Demotion and ZSwap, you create a situation where ZSwap
will prefer the slowest form of CXL memory by default until that tier of
memory is exhausted.
+165 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=====================
Devices and Protocols
=====================

The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section
covers some basic background on device types and on-device resources used by the platform and OS
which impact configuration.

Protocols
=========

There are three core protocols to CXL.  For the purpose of this documentation,
we will only discuss very high level definitions as the specific hardware
details are largely abstracted away from Linux.  See the CXL specification
for more details.

CXL.io
------
The basic interaction protocol, similar to PCIe configuration mechanisms.
Typically used for initialization, configuration, and I/O access for anything
other than memory (CXL.mem) or cache (CXL.cache) operations.

The Linux CXL driver exposes access to .io functionalty via the various sysfs
interfaces and /dev/cxl/ devices (which exposes direct access to device
mailboxes).

CXL.cache
---------
The mechanism by which a device may coherently access and cache host memory.

Largely transparent to Linux once configured.

CXL.mem
---------
The mechanism by which the CPU may coherently access and cache device memory.

Largely transparent to Linux once configured.


Device Types
============

Type-1
------

A Type-1 CXL device:

* Supports cxl.io and cxl.cache protocols
* Implements a fully coherent cache
* Allows Device-to-Host coherence and Host-to-Device snoops.
* Does NOT have host-managed device memory (HDM)

Typical examples of type-1 devices is a Smart NIC - which may want to
directly operate on host-memory (DMA) to store incoming packets. These
devices largely rely on CPU-attached memory.

Type-2
------

A Type-2 CXL Device:

* Supports cxl.io, cxl.cache, and cxl.mem protocols
* Optionally implements coherent cache and Host-Managed Device Memory
* Is typically an accelerator device w/ high bandwidth memory.

The primary difference between a type-1 and type-2 device is the presence
of host-managed device memory, which allows the device to operate on a
local memory bank - while the CPU sill has coherent DMA to the same memory.

The allows things like GPUs to expose their memory via DAX devices or file
descriptors, allows drivers and programs direct access to device memory
rather than use block-transfer semantics.

Type-3
------

A Type-3 CXL Device

* Supports cxl.io and cxl.mem
* Implements Host-Managed Device Memory
* May provide either Volatile or Persistent memory capacity (or both).

A basic example of a type-3 device is a simple memory expander, whose
local memory capacity is exposed to the CPU for access directly via
basic coherent DMA.

Switch
------

A CXL switch is a device capacity of routing any CXL (and by extension, PCIe)
protocol between an upstream, downstream, or peer devices.  Many devices, such
as Multi-Logical Devices, imply the presence of switching in some manner.

Logical Devices and Heads
-------------------------

A CXL device may present one or more "Logical Devices" to one or more hosts
(via physical "Heads").

A Single-Logical Device (SLD) is a device which presents a single device to
one or more heads.

A Multi-Logical Device (MLD) is a device which may present multiple devices
to one or more devices.

A Single-Headed Device exposes only a single physical connection.

A Multi-Headed Device exposes multiple physical connections.

MHSLD
~~~~~
A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
device to multiple heads which may be connected to one or more discrete
hosts.  An example of this would be a simple memory-pool which may be
statically configured (prior to boot) to expose portions of its memory
to Linux via :doc:`CEDT <../platform/acpi/cedt>`.

MHMLD
~~~~~
A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical
devices to multiple heads which may be connected to one or more discrete
hosts.  An example of this would be a Dynamic Capacity Device or which
may be configured at runtime to expose portions of its memory to Linux.

Example Devices
===============

Memory Expander
---------------
The simplest form of Type-3 device is a memory expander.  A memory expander
exposes Host-Managed Device Memory (HDM) to Linux.  This memory may be
Volatile or Non-Volatile (Persistent).

Memory Expanders will typically be considered a form of Single-Headed,
Single-Logical Device - as its form factor will typically be an add-in-card
(AIC) or some other similar form-factor.

The Linux CXL driver provides support for static or dynamic configuration of
basic memory expanders.  The platform may program decoders prior to OS init
(e.g. auto-decoders), or the user may program the fabric if the platform
defers these operations to the OS.

Multiple Memory Expanders may be added to an external chassis and exposed to
a host via a head attached to a CXL switch.  This is a "memory pool", and
would be considered an MHSLD or MHMLD depending on the management capabilities
provided by the switch platform.

As of v6.14, Linux does not provide a formalized interface to manage non-DCD
MHSLD or MHMLD devices.

Dynamic Capacity Device (DCD)
-----------------------------

A Dynamic Capacity Device is a Type-3 device which provides dynamic management
of memory capacity. The basic premise of a DCD to provide an allocator-like
interface for physical memory capacity to a "Fabric Manager" (an external,
privileged host with privileges to change configurations for other hosts).

A DCD manages "Memory Extents", which may be volatile or persistent. Extents
may also be exclusive to a single host or shared across multiple hosts.

As of v6.14, Linux does not provide a formalized interface to manage DCD
devices, however there is active work on LKML targeting future release.
Loading