Merge branch 'for-6.16/cxl-docs' into cxl-for-next (58dfd959) · Commits · git / linux-net

Documentation/driver-api/cxl/allocation/dax.rst

0 → 100644

+60 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		===========
		DAX Devices
		===========
		CXL capacity exposed as a DAX device can be accessed directly via mmap.
		Users may wish to use this interface mechanism to write their own userland
		CXL allocator, or to managed shared or persistent memory regions across multiple
		hosts.

		If the capacity is shared across hosts or persistent, appropriate flushing
		mechanisms must be employed unless the region supports Snoop Back-Invalidate.

		Note that mappings must be aligned (size and base) to the dax device's base
		alignment, which is typically 2MB - but maybe be configured larger.

		::

		#include <stdio.h>
		#include <stdlib.h>
		#include <stdint.h>
		#include <sys/mman.h>
		#include <fcntl.h>
		#include <unistd.h>

		#define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
		#define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB

		int main() {
		int fd;
		void* mapped_addr;

		/* Open the DAX device */
		fd = open(DEVICE_PATH, O_RDWR);
		if (fd < 0) {
		perror("open");
		return -1;
		}

		/* Map the device into memory */
		mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ \| PROT_WRITE,
		MAP_SHARED, fd, 0);
		if (mapped_addr == MAP_FAILED) {
		perror("mmap");
		close(fd);
		return -1;
		}

		printf("Mapped address: %p\n", mapped_addr);

		/* You can now access the device through the mapped address */
		uint64_t* ptr = (uint64_t*)mapped_addr;
		*ptr = 0x1234567890abcdef; // Write a value to the device
		printf("Value at address %p: 0x%016llx\n", ptr, *ptr);

		/* Clean up */
		munmap(mapped_addr, DEVICE_SIZE);
		close(fd);
		return 0;
		}

Documentation/driver-api/cxl/allocation/hugepages.rst

0 → 100644

+32 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		==========
		Huge Pages
		==========

		Contiguous Memory Allocator
		===========================
		CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
		as the NUMA node hosting that capacity will be `Online` at the time CMA
		carves out contiguous capacity.

		CXL Memory deferred to the CXL Driver for configuration cannot have its
		capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
		at :code:`__init` time - when CMA carves out contiguous capacity.

		HugeTLB
		=======
		Different huge page sizes allow different memory configurations.

		2MB Huge Pages
		--------------
		All CXL capacity regardless of configuration time or memory zone is eligible
		for use as 2MB huge pages.

		1GB Huge Pages
		--------------
		CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
		allocation.

		CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
		Page allocation.

Documentation/driver-api/cxl/allocation/page-allocator.rst

0 → 100644

+85 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		==================
		The Page Allocator
		==================

		The kernel page allocator services all general page allocation requests, such
		as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
		allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
		placed in.

		This section mostly focuses on how these configurations affect the page
		allocator (as of Linux v6.15) rather than the overall page allocator behavior.

		NUMA nodes and mempolicy
		========================
		Unless a task explicitly registers a mempolicy, the default memory policy
		of the linux kernel is to allocate memory from the `local NUMA node` first,
		and fall back to other nodes only if the local node is pressured.

		Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
		with the CXL memory being non-local. Technically, however, it is possible
		for a compute node to have no local DRAM, and for CXL memory to be the
		`local` capacity for that compute node.


		Memory Zones
		============
		CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.

		As of v6.15, the page allocator attempts to allocate from the highest
		available and compatible ZONE for an allocation from the local node first.

		An example of a `zone incompatibility` is attempting to service an allocation
		marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
		typically not migratable, and as a result can only be serviced from
		:code:`ZONE_NORMAL` or lower.

		To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
		:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
		will fallback to allocate from :code:`ZONE_NORMAL`.


		Zone and Node Quirks
		====================
		Let's consider a configuration where the local DRAM capacity is largely onlined
		into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
		CXL capacity has the opposite configuration - all onlined in
		:code:`ZONE_MOVABLE`.

		Under the default allocation policy, the page allocator will completely skip
		:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
		Linux v6.15, the page allocator does (approximately) the following: ::

		for (each zone in local_node):

		for (each node in fallback_order):

		attempt_allocation(gfp_flags);

		Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
		functionally unreachable for direct allocation. As a result, the only way
		for CXL capacity to be used is via `demotion` in the reclaim path.

		This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
		capacity - when that capacity is depleted, the page allocator will actually
		prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.

		We may wish to invert this priority in future Linux versions.

		If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
		when the DRAM nodes are depleted. See the reclaim section for more details.


		CGroups and CPUSets
		===================
		Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
		in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
		containers to limit the accessibility of certain NUMA nodes for tasks in that
		container. Users may wish to utilize this in multi-tenant systems where some
		tasks prefer not to use slower memory.

		In the reclaim section we'll discuss some limitations of this interface to
		prevent demotions of shared data to CXL memory (if demotions are enabled).

Documentation/driver-api/cxl/allocation/reclaim.rst

0 → 100644

+51 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		=======
		Reclaim
		=======
		Another way CXL memory can be utilized indirectly is via the reclaim system
		in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system
		becomes pressured based on global and cgroup-local `watermark` settings.

		In this section we won't discuss the `watermark` configurations, just how CXL
		memory can be consumed by various pieces of reclaim system.

		Demotion
		========
		By default, the reclaim system will prefer swap (or zswap) when reclaiming
		memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
		to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
		is available.

		Demotion engages the :code:`mm/memory_tier.c` component to determine the
		next demotion node. The next demotion node is based on the :code:`HMAT`
		or :code:`CDAT` performance data.

		cpusets.mems_allowed quirk
		--------------------------
		In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
		when migrating pages. As a result, if demotion is enabled, vmscan cannot
		guarantee isolation of a container's memory from nodes not set in mems_allowed.

		In Linux v6.XX and up, demotion does attempt to respect
		:code:`cpusets.mems_allowed`; however, certain classes of shared memory
		originally instantiated by another cgroup (such as common libraries - e.g.
		libc) may still be demoted. As a result, the mems_allowed interface still
		cannot provide perfect isolation from the remote nodes.

		ZSwap and Node Preference
		=========================
		In Linux v6.15 and below, ZSwap allocates memory from the local node of the
		processor for the new pages being compressed. Since pages being compressed
		are typically cold, the result is a cold page becomes promoted - only to
		be later demoted as it ages off the LRU.

		In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
		as the allocation target for the compression page. This helps prevent
		thrashing.

		Demotion with ZSwap
		===================
		When enabling both Demotion and ZSwap, you create a situation where ZSwap
		will prefer the slowest form of CXL memory by default until that tier of
		memory is exhausted.

Documentation/driver-api/cxl/devices/device-types.rst

0 → 100644

+165 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		=====================
		Devices and Protocols
		=====================

		The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section
		covers some basic background on device types and on-device resources used by the platform and OS
		which impact configuration.

		Protocols
		=========

		There are three core protocols to CXL. For the purpose of this documentation,
		we will only discuss very high level definitions as the specific hardware
		details are largely abstracted away from Linux. See the CXL specification
		for more details.

		CXL.io
		------
		The basic interaction protocol, similar to PCIe configuration mechanisms.
		Typically used for initialization, configuration, and I/O access for anything
		other than memory (CXL.mem) or cache (CXL.cache) operations.

		The Linux CXL driver exposes access to .io functionalty via the various sysfs
		interfaces and /dev/cxl/ devices (which exposes direct access to device
		mailboxes).

		CXL.cache
		---------
		The mechanism by which a device may coherently access and cache host memory.

		Largely transparent to Linux once configured.

		CXL.mem
		---------
		The mechanism by which the CPU may coherently access and cache device memory.

		Largely transparent to Linux once configured.


		Device Types
		============

		Type-1
		------

		A Type-1 CXL device:

		* Supports cxl.io and cxl.cache protocols
		* Implements a fully coherent cache
		* Allows Device-to-Host coherence and Host-to-Device snoops.
		* Does NOT have host-managed device memory (HDM)

		Typical examples of type-1 devices is a Smart NIC - which may want to
		directly operate on host-memory (DMA) to store incoming packets. These
		devices largely rely on CPU-attached memory.

		Type-2
		------

		A Type-2 CXL Device:

		* Supports cxl.io, cxl.cache, and cxl.mem protocols
		* Optionally implements coherent cache and Host-Managed Device Memory
		* Is typically an accelerator device w/ high bandwidth memory.

		The primary difference between a type-1 and type-2 device is the presence
		of host-managed device memory, which allows the device to operate on a
		local memory bank - while the CPU sill has coherent DMA to the same memory.

		The allows things like GPUs to expose their memory via DAX devices or file
		descriptors, allows drivers and programs direct access to device memory
		rather than use block-transfer semantics.

		Type-3
		------

		A Type-3 CXL Device

		* Supports cxl.io and cxl.mem
		* Implements Host-Managed Device Memory
		* May provide either Volatile or Persistent memory capacity (or both).

		A basic example of a type-3 device is a simple memory expander, whose
		local memory capacity is exposed to the CPU for access directly via
		basic coherent DMA.

		Switch
		------

		A CXL switch is a device capacity of routing any CXL (and by extension, PCIe)
		protocol between an upstream, downstream, or peer devices. Many devices, such
		as Multi-Logical Devices, imply the presence of switching in some manner.

		Logical Devices and Heads
		-------------------------

		A CXL device may present one or more "Logical Devices" to one or more hosts
		(via physical "Heads").

		A Single-Logical Device (SLD) is a device which presents a single device to
		one or more heads.

		A Multi-Logical Device (MLD) is a device which may present multiple devices
		to one or more devices.

		A Single-Headed Device exposes only a single physical connection.

		A Multi-Headed Device exposes multiple physical connections.

		MHSLD
		~~~~~
		A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
		device to multiple heads which may be connected to one or more discrete
		hosts. An example of this would be a simple memory-pool which may be
		statically configured (prior to boot) to expose portions of its memory
		to Linux via :doc:`CEDT <../platform/acpi/cedt>`.

		MHMLD
		~~~~~
		A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical
		devices to multiple heads which may be connected to one or more discrete
		hosts. An example of this would be a Dynamic Capacity Device or which
		may be configured at runtime to expose portions of its memory to Linux.

		Example Devices
		===============

		Memory Expander
		---------------
		The simplest form of Type-3 device is a memory expander. A memory expander
		exposes Host-Managed Device Memory (HDM) to Linux. This memory may be
		Volatile or Non-Volatile (Persistent).

		Memory Expanders will typically be considered a form of Single-Headed,
		Single-Logical Device - as its form factor will typically be an add-in-card
		(AIC) or some other similar form-factor.

		The Linux CXL driver provides support for static or dynamic configuration of
		basic memory expanders. The platform may program decoders prior to OS init
		(e.g. auto-decoders), or the user may program the fabric if the platform
		defers these operations to the OS.

		Multiple Memory Expanders may be added to an external chassis and exposed to
		a host via a head attached to a CXL switch. This is a "memory pool", and
		would be considered an MHSLD or MHMLD depending on the management capabilities
		provided by the switch platform.

		As of v6.14, Linux does not provide a formalized interface to manage non-DCD
		MHSLD or MHMLD devices.

		Dynamic Capacity Device (DCD)
		-----------------------------

		A Dynamic Capacity Device is a Type-3 device which provides dynamic management
		of memory capacity. The basic premise of a DCD to provide an allocator-like
		interface for physical memory capacity to a "Fabric Manager" (an external,
		privileged host with privileges to change configurations for other hosts).

		A DCD manages "Memory Extents", which may be volatile or persistent. Extents
		may also be exclusive to a single host or shared across multiple hosts.

		As of v6.14, Linux does not provide a formalized interface to manage DCD
		devices, however there is active work on LKML targeting future release.