Merge tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl (29e93590) · Commits · git / linux-nf

Documentation/ABI/testing/sysfs-bus-cxl

+2 −2

Original line number	Diff line number	Diff line
		@@ -242,7 +242,7 @@ Description:
		decoding a Host Physical Address range. Note that this number
		may be elevated without any regionX objects active or even
		enumerated, as this may be due to decoders established by
		platform firwmare or a previous kernel (kexec).
		platform firmware or a previous kernel (kexec).


		What: /sys/bus/cxl/devices/decoderX.Y
		@@ -572,7 +572,7 @@ Description:


		What: /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
		/sys/bus/cxl/devices/regionZ/accessY/write_banwidth
		/sys/bus/cxl/devices/regionZ/accessY/write_bandwidth
		Date: Jan, 2024
		KernelVersion: v6.9
		Contact: linux-cxl@vger.kernel.org

Documentation/driver-api/cxl/allocation/dax.rst

0 → 100644

+60 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		===========
		DAX Devices
		===========
		CXL capacity exposed as a DAX device can be accessed directly via mmap.
		Users may wish to use this interface mechanism to write their own userland
		CXL allocator, or to managed shared or persistent memory regions across multiple
		hosts.

		If the capacity is shared across hosts or persistent, appropriate flushing
		mechanisms must be employed unless the region supports Snoop Back-Invalidate.

		Note that mappings must be aligned (size and base) to the dax device's base
		alignment, which is typically 2MB - but maybe be configured larger.

		::

		#include <stdio.h>
		#include <stdlib.h>
		#include <stdint.h>
		#include <sys/mman.h>
		#include <fcntl.h>
		#include <unistd.h>

		#define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
		#define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB

		int main() {
		int fd;
		void* mapped_addr;

		/* Open the DAX device */
		fd = open(DEVICE_PATH, O_RDWR);
		if (fd < 0) {
		perror("open");
		return -1;
		}

		/* Map the device into memory */
		mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ \| PROT_WRITE,
		MAP_SHARED, fd, 0);
		if (mapped_addr == MAP_FAILED) {
		perror("mmap");
		close(fd);
		return -1;
		}

		printf("Mapped address: %p\n", mapped_addr);

		/* You can now access the device through the mapped address */
		uint64_t* ptr = (uint64_t*)mapped_addr;
		*ptr = 0x1234567890abcdef; // Write a value to the device
		printf("Value at address %p: 0x%016llx\n", ptr, *ptr);

		/* Clean up */
		munmap(mapped_addr, DEVICE_SIZE);
		close(fd);
		return 0;
		}

Documentation/driver-api/cxl/allocation/hugepages.rst

0 → 100644

+32 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		==========
		Huge Pages
		==========

		Contiguous Memory Allocator
		===========================
		CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
		as the NUMA node hosting that capacity will be `Online` at the time CMA
		carves out contiguous capacity.

		CXL Memory deferred to the CXL Driver for configuration cannot have its
		capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
		at :code:`__init` time - when CMA carves out contiguous capacity.

		HugeTLB
		=======
		Different huge page sizes allow different memory configurations.

		2MB Huge Pages
		--------------
		All CXL capacity regardless of configuration time or memory zone is eligible
		for use as 2MB huge pages.

		1GB Huge Pages
		--------------
		CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
		allocation.

		CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
		Page allocation.

Documentation/driver-api/cxl/allocation/page-allocator.rst

0 → 100644

+85 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		==================
		The Page Allocator
		==================

		The kernel page allocator services all general page allocation requests, such
		as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
		allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
		placed in.

		This section mostly focuses on how these configurations affect the page
		allocator (as of Linux v6.15) rather than the overall page allocator behavior.

		NUMA nodes and mempolicy
		========================
		Unless a task explicitly registers a mempolicy, the default memory policy
		of the linux kernel is to allocate memory from the `local NUMA node` first,
		and fall back to other nodes only if the local node is pressured.

		Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
		with the CXL memory being non-local. Technically, however, it is possible
		for a compute node to have no local DRAM, and for CXL memory to be the
		`local` capacity for that compute node.


		Memory Zones
		============
		CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.

		As of v6.15, the page allocator attempts to allocate from the highest
		available and compatible ZONE for an allocation from the local node first.

		An example of a `zone incompatibility` is attempting to service an allocation
		marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
		typically not migratable, and as a result can only be serviced from
		:code:`ZONE_NORMAL` or lower.

		To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
		:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
		will fallback to allocate from :code:`ZONE_NORMAL`.


		Zone and Node Quirks
		====================
		Let's consider a configuration where the local DRAM capacity is largely onlined
		into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
		CXL capacity has the opposite configuration - all onlined in
		:code:`ZONE_MOVABLE`.

		Under the default allocation policy, the page allocator will completely skip
		:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
		Linux v6.15, the page allocator does (approximately) the following: ::

		for (each zone in local_node):

		for (each node in fallback_order):

		attempt_allocation(gfp_flags);

		Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
		functionally unreachable for direct allocation. As a result, the only way
		for CXL capacity to be used is via `demotion` in the reclaim path.

		This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
		capacity - when that capacity is depleted, the page allocator will actually
		prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.

		We may wish to invert this priority in future Linux versions.

		If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
		when the DRAM nodes are depleted. See the reclaim section for more details.


		CGroups and CPUSets
		===================
		Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
		in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
		containers to limit the accessibility of certain NUMA nodes for tasks in that
		container. Users may wish to utilize this in multi-tenant systems where some
		tasks prefer not to use slower memory.

		In the reclaim section we'll discuss some limitations of this interface to
		prevent demotions of shared data to CXL memory (if demotions are enabled).

Documentation/driver-api/cxl/allocation/reclaim.rst

0 → 100644

+51 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		=======
		Reclaim
		=======
		Another way CXL memory can be utilized indirectly is via the reclaim system
		in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system
		becomes pressured based on global and cgroup-local `watermark` settings.

		In this section we won't discuss the `watermark` configurations, just how CXL
		memory can be consumed by various pieces of reclaim system.

		Demotion
		========
		By default, the reclaim system will prefer swap (or zswap) when reclaiming
		memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
		to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
		is available.

		Demotion engages the :code:`mm/memory_tier.c` component to determine the
		next demotion node. The next demotion node is based on the :code:`HMAT`
		or :code:`CDAT` performance data.

		cpusets.mems_allowed quirk
		--------------------------
		In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
		when migrating pages. As a result, if demotion is enabled, vmscan cannot
		guarantee isolation of a container's memory from nodes not set in mems_allowed.

		In Linux v6.XX and up, demotion does attempt to respect
		:code:`cpusets.mems_allowed`; however, certain classes of shared memory
		originally instantiated by another cgroup (such as common libraries - e.g.
		libc) may still be demoted. As a result, the mems_allowed interface still
		cannot provide perfect isolation from the remote nodes.

		ZSwap and Node Preference
		=========================
		In Linux v6.15 and below, ZSwap allocates memory from the local node of the
		processor for the new pages being compressed. Since pages being compressed
		are typically cold, the result is a cold page becomes promoted - only to
		be later demoted as it ages off the LRU.

		In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
		as the allocation target for the compression page. This helps prevent
		thrashing.

		Demotion with ZSwap
		===================
		When enabling both Demotion and ZSwap, you create a situation where ZSwap
		will prefer the slowest form of CXL memory by default until that tier of
		memory is exhausted.