Commit 29e93590 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull Compute Express Link (CXL) updates from Dave Jiang:

 - Remove always true condition in cxl features code

 - Add verification of CHBS length for CXL 2.0

 - Ignore interleave granularity when interleave ways is 1

 - Add update addressing mising MODULE_DESCRIPTION for cxl_test

 - A series of cleanups/refactor to prep for AMD Zen5 translate code

 - Clean %pa debug printk in core/hdm.c

 - Documentation updates:
     - Update to CXL Maturity Map
     - Fixes to source linking in CXL documentation
     - CXL documentation fixes, spelling corrections
     - A large collection of CXL documentation for the entire CXL
       subsystem, including documentation on CXL related platform and
       firmware notes

 - Remove redundant code of cxlctl_get_supported_features()

 - Series to support CXL RAS Features
     - Including "Patrol Scrub Control", "Error Check Scrub",
       "Performance Maitenance" and "Memory Sparing". The series
       connects CXL to EDAC.

* tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (53 commits)
  cxl/edac: Add CXL memory device soft PPR control feature
  cxl/edac: Add CXL memory device memory sparing control feature
  cxl/edac: Support for finding memory operation attributes from the current boot
  cxl/edac: Add support for PERFORM_MAINTENANCE command
  cxl/edac: Add CXL memory device ECS control feature
  cxl/edac: Add CXL memory device patrol scrub control feature
  cxl: Update prototype of function get_support_feature_info()
  EDAC: Update documentation for the CXL memory patrol scrub control feature
  cxl/features: Remove the inline specifier from to_cxlfs()
  cxl/feature: Remove redundant code of get supported features
  docs: ABI: Fix "firwmare" to "firmware"
  cxl/Documentation: Fix typo in sysfs write_bandwidth attribute path
  cxl: doc/linux/access-coordinates Update access coordinates calculation methods
  cxl: docs/platform/acpi/srat Add generic target documentation
  cxl: docs/platform/cdat reference documentation
  Documentation: Update the CXL Maturity Map
  cxl: Sync up the driver-api/cxl documentation
  cxl: docs - add self-referencing cross-links
  cxl: docs/allocation/hugepages
  cxl: docs/allocation/reclaim
  ...
parents a9dfb7db 9f153b7f
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -242,7 +242,7 @@ Description:
		decoding a Host Physical Address range. Note that this number
		may be elevated without any regionX objects active or even
		enumerated, as this may be due to decoders established by
		platform firwmare or a previous kernel (kexec).
		platform firmware or a previous kernel (kexec).


What:		/sys/bus/cxl/devices/decoderX.Y
@@ -572,7 +572,7 @@ Description:


What:		/sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
		/sys/bus/cxl/devices/regionZ/accessY/write_banwidth
		/sys/bus/cxl/devices/regionZ/accessY/write_bandwidth
Date:		Jan, 2024
KernelVersion:	v6.9
Contact:	linux-cxl@vger.kernel.org
+60 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===========
DAX Devices
===========
CXL capacity exposed as a DAX device can be accessed directly via mmap.
Users may wish to use this interface mechanism to write their own userland
CXL allocator, or to managed shared or persistent memory regions across multiple
hosts.

If the capacity is shared across hosts or persistent, appropriate flushing
mechanisms must be employed unless the region supports Snoop Back-Invalidate.

Note that mappings must be aligned (size and base) to the dax device's base
alignment, which is typically 2MB - but maybe be configured larger.

::

  #include <stdio.h>
  #include <stdlib.h>
  #include <stdint.h>
  #include <sys/mman.h>
  #include <fcntl.h>
  #include <unistd.h>

  #define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
  #define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB

  int main() {
      int fd;
      void* mapped_addr;

      /* Open the DAX device */
      fd = open(DEVICE_PATH, O_RDWR);
      if (fd < 0) {
          perror("open");
          return -1;
      }

      /* Map the device into memory */
      mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE,
                         MAP_SHARED, fd, 0);
      if (mapped_addr == MAP_FAILED) {
          perror("mmap");
          close(fd);
          return -1;
      }

      printf("Mapped address: %p\n", mapped_addr);

      /* You can now access the device through the mapped address */
      uint64_t* ptr = (uint64_t*)mapped_addr;
      *ptr = 0x1234567890abcdef; // Write a value to the device
      printf("Value at address %p: 0x%016llx\n", ptr, *ptr);

      /* Clean up */
      munmap(mapped_addr, DEVICE_SIZE);
      close(fd);
      return 0;
  }
+32 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

==========
Huge Pages
==========

Contiguous Memory Allocator
===========================
CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
as the NUMA node hosting that capacity will be `Online` at the time CMA
carves out contiguous capacity.

CXL Memory deferred to the CXL Driver for configuration cannot have its
capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
at :code:`__init` time - when CMA carves out contiguous capacity.

HugeTLB
=======
Different huge page sizes allow different memory configurations.

2MB Huge Pages
--------------
All CXL capacity regardless of configuration time or memory zone is eligible
for use as 2MB huge pages.

1GB Huge Pages
--------------
CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
allocation.

CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
Page allocation.
+85 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

==================
The Page Allocator
==================

The kernel page allocator services all general page allocation requests, such
as :code:`kmalloc`.  CXL configuration steps affect the behavior of the page
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
placed in.

This section mostly focuses on how these configurations affect the page
allocator (as of Linux v6.15) rather than the overall page allocator behavior.

NUMA nodes and mempolicy
========================
Unless a task explicitly registers a mempolicy, the default memory policy
of the linux kernel is to allocate memory from the `local NUMA node` first,
and fall back to other nodes only if the local node is pressured.

Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
with the CXL memory being non-local.  Technically, however, it is possible
for a compute node to have no local DRAM, and for CXL memory to be the
`local` capacity for that compute node.


Memory Zones
============
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.

As of v6.15, the page allocator attempts to allocate from the highest
available and compatible ZONE for an allocation from the local node first.

An example of a `zone incompatibility` is attempting to service an allocation
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`.  Kernel allocations are
typically not migratable, and as a result can only be serviced from
:code:`ZONE_NORMAL` or lower.

To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
will fallback to allocate from :code:`ZONE_NORMAL`.


Zone and Node Quirks
====================
Let's consider a configuration where the local DRAM capacity is largely onlined
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
CXL capacity has the opposite configuration - all onlined in
:code:`ZONE_MOVABLE`.

Under the default allocation policy, the page allocator will completely skip
:code:`ZONE_MOVABLE` as a valid allocation target.  This is because, as of
Linux v6.15, the page allocator does (approximately) the following: ::

  for (each zone in local_node):

    for (each node in fallback_order):

      attempt_allocation(gfp_flags);

Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
functionally unreachable for direct allocation.  As a result, the only way
for CXL capacity to be used is via `demotion` in the reclaim path.

This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
capacity - when that capacity is depleted, the page allocator will actually
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.

We may wish to invert this priority in future Linux versions.

If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
when the DRAM nodes are depleted. See the reclaim section for more details.


CGroups and CPUSets
===================
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
containers to limit the accessibility of certain NUMA nodes for tasks in that
container.  Users may wish to utilize this in multi-tenant systems where some
tasks prefer not to use slower memory.

In the reclaim section we'll discuss some limitations of this interface to
prevent demotions of shared data to CXL memory (if demotions are enabled).
+51 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=======
Reclaim
=======
Another way CXL memory can be utilized *indirectly* is via the reclaim system
in :code:`mm/vmscan.c`.  Reclaim is engaged when memory capacity on the system
becomes pressured based on global and cgroup-local `watermark` settings.

In this section we won't discuss the `watermark` configurations, just how CXL
memory can be consumed by various pieces of reclaim system.

Demotion
========
By default, the reclaim system will prefer swap (or zswap) when reclaiming
memory.  Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
is available.

Demotion engages the :code:`mm/memory_tier.c` component to determine the
next demotion node.  The next demotion node is based on the :code:`HMAT`
or :code:`CDAT` performance data.

cpusets.mems_allowed quirk
--------------------------
In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
when migrating pages.  As a result, if demotion is enabled, vmscan cannot
guarantee isolation of a container's memory from nodes not set in mems_allowed.

In Linux v6.XX and up, demotion does attempt to respect
:code:`cpusets.mems_allowed`; however, certain classes of shared memory
originally instantiated by another cgroup (such as common libraries - e.g.
libc) may still be demoted.  As a result, the mems_allowed interface still
cannot provide perfect isolation from the remote nodes.

ZSwap and Node Preference
=========================
In Linux v6.15 and below, ZSwap allocates memory from the local node of the
processor for the new pages being compressed.  Since pages being compressed
are typically cold, the result is a cold page becomes promoted - only to
be later demoted as it ages off the LRU.

In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
as the allocation target for the compression page.  This helps prevent
thrashing.

Demotion with ZSwap
===================
When enabling both Demotion and ZSwap, you create a situation where ZSwap
will prefer the slowest form of CXL memory by default until that tier of
memory is exhausted.
Loading