Commit 0ccff074 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull fwctl subsystem from Jason Gunthorpe:
 "fwctl is a new subsystem intended to bring some common rules and order
  to the growing pattern of exposing a secure FW interface directly to
  userspace.

  Unlike existing places like RDMA/DRM/VFIO/uacce that are exposing a
  device for datapath operations fwctl is focused on debugging,
  configuration and provisioning of the device. It will not have the
  necessary features like interrupt delivery to support a datapath.

  This concept is similar to the long standing practice in the "HW" RAID
  space of having a device specific misc device to manage the RAID
  controller FW. fwctl generalizes this notion of a companion debug and
  management interface that goes along with a dataplane implemented in
  an appropriate subsystem.

  There have been three LWN articles written discussing various aspects
  of this:

	https://lwn.net/Articles/955001/
	https://lwn.net/Articles/969383/
	https://lwn.net/Articles/990802/

  This includes three drivers to launch the subsystem:

   - CXL provides a vendor scheme for executing commands and a way to
     learn the 'command effects' (ie the security properties) of such
     commands. The fwctl driver allows access to these mechanism within
     the fwctl security model

   - mlx5 is family of networking products, the driver supports all
     current Mellanox HW still receiving FW feature updates. This
     includes RDMA multiprotocol NICs like ConnectX and the Bluefield
     family of Smart NICs.

   - AMD/Pensando Distributed Services card is a multi protocol Smart
     NIC with a multi PCI function design. fwctl works on the management
     PCI function following a 'command effects' model similar to CXL"

* tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (30 commits)
  pds_fwctl: add Documentation entries
  pds_fwctl: add rpc and query support
  pds_fwctl: initial driver framework
  pds_core: add new fwctl auxiliary_device
  pds_core: specify auxiliary_device to be created
  pds_core: make pdsc_auxbus_dev_del() void
  cxl: Fixup kdoc issues for include/cxl/features.h
  fwctl/cxl: Add documentation to FWCTL CXL
  cxl/test: Add Set Feature support to cxl_test
  cxl/test: Add Get Feature support to cxl_test
  cxl: Add support to handle user feature commands for set feature
  cxl: Add support to handle user feature commands for get feature
  cxl: Add support for fwctl RPC command to enable CXL feature commands
  cxl: Move cxl feature command structs to user header
  cxl: Add FWCTL support to CXL
  mlx5: Create an auxiliary device for fwctl_mlx5
  fwctl/mlx5: Support for communicating with mlx5 fw
  fwctl: Add documentation
  fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
  taint: Add TAINT_FWCTL
  ...
parents e5e0e6be 40325707
Loading
Loading
Loading
Loading
+5 −0
Original line number Diff line number Diff line
@@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted
 16  _/X   65536  auxiliary taint, defined for and used by distros
 17  _/T  131072  kernel was built with the struct randomization plugin
 18  _/N  262144  an in-kernel test has been run
 19  _/J  524288  userspace used a mutating debug operation in fwctl
===  ===  ======  ========================================================

Note: The character ``_`` is representing a blank in this table to make reading
@@ -184,3 +185,7 @@ More detailed explanation for tainting
     build time.

 18) ``N`` if an in-kernel test, such as a KUnit test, has been run.

 19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
     to use the devices debugging features. Device debugging features could
     cause the device to malfunction in undefined ways.
+142 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

================
fwctl cxl driver
================

:Author: Dave Jiang

Overview
========

The CXL spec defines a set of commands that can be issued to the mailbox of a
CXL device or switch. It also left room for vendor specific commands to be
issued to the mailbox as well. fwctl provides a path to issue a set of allowed
mailbox commands from user space to the device moderated by the kernel driver.

The following 3 commands will be used to support CXL Features:
CXL spec r3.1 8.2.9.6.1 Get Supported Features (Opcode 0500h)
CXL spec r3.1 8.2.9.6.2 Get Feature (Opcode 0501h)
CXL spec r3.1 8.2.9.6.3 Set Feature (Opcode 0502h)

The "Get Supported Features" return data may be filtered by the kernel driver to
drop any features that are forbidden by the kernel or being exclusively used by
the kernel. The driver will set the "Set Feature Size" of the "Get Supported
Features Supported Feature Entry" to 0 to indicate that the Feature cannot be
modified. The "Get Supported Features" command and the "Get Features" falls
under the fwctl policy of FWCTL_RPC_CONFIGURATION.

For "Set Feature" command, the access policy currently is broken down into two
categories depending on the Set Feature effects reported by the device. If the
Set Feature will cause immediate change to the device, the fwctl access policy
must be FWCTL_RPC_DEBUG_WRITE_FULL. The effects for this level are
"immediate config change", "immediate data change", "immediate policy change",
or "immediate log change" for the set effects mask. If the effects are "config
change with cold reset" or "config change with conventional reset", then the
fwctl access policy must be FWCTL_RPC_DEBUG_WRITE or higher.

fwctl cxl User API
==================

.. kernel-doc:: include/uapi/fwctl/cxl.h

1. Driver info query
--------------------

First step for the app is to issue the ioctl(FWCTL_CMD_INFO). Successful
invocation of the ioctl implies the Features capability is operational and
returns an all zeros 32bit payload. A ``struct fwctl_info`` needs to be filled
out with the ``fwctl_info.out_device_type`` set to ``FWCTL_DEVICE_TYPE_CXL``.
The return data should be ``struct fwctl_info_cxl`` that contains a reserved
32bit field that should be all zeros.

2. Send hardware commands
-------------------------

Next step is to send the 'Get Supported Features' command to the driver from
user space via ioctl(FWCTL_RPC). A ``struct fwctl_rpc_cxl`` is pointed to
by ``fwctl_rpc.in``. ``struct fwctl_rpc_cxl.in_payload`` points to
the hardware input structure that is defined by the CXL spec. ``fwctl_rpc.out``
points to the buffer that contains a ``struct fwctl_rpc_cxl_out`` that includes
the hardware output data inlined as ``fwctl_rpc_cxl_out.payload``. This command
is called twice. First time to retrieve the number of features supported.
A second time to retrieve the specific feature details as the output data.

After getting the specific feature details, a Get/Set Feature command can be
appropriately programmed and sent. For a "Set Feature" command, the retrieved
feature info contains an effects field that details the resulting
"Set Feature" command will trigger. That will inform the user whether
the system is configured to allowed the "Set Feature" command or not.

Code example of a Get Feature
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: c

        static int cxl_fwctl_rpc_get_test_feature(int fd, struct test_feature *feat_ctx,
                                                  const uint32_t expected_data)
        {
                struct cxl_mbox_get_feat_in *feat_in;
                struct fwctl_rpc_cxl_out *out;
                struct fwctl_rpc rpc = {0};
                struct fwctl_rpc_cxl *in;
                size_t out_size, in_size;
                uint32_t val;
                void *data;
                int rc;

                in_size = sizeof(*in) + sizeof(*feat_in);
                rc = posix_memalign((void **)&in, 16, in_size);
                if (rc)
                        return -ENOMEM;
                memset(in, 0, in_size);
                feat_in = &in->get_feat_in;

                uuid_copy(feat_in->uuid, feat_ctx->uuid);
                feat_in->count = feat_ctx->get_size;

                out_size = sizeof(*out) + feat_ctx->get_size;
                rc = posix_memalign((void **)&out, 16, out_size);
                if (rc)
                        goto free_in;
                memset(out, 0, out_size);

                in->opcode = CXL_MBOX_OPCODE_GET_FEATURE;
                in->op_size = sizeof(*feat_in);

                rpc.size = sizeof(rpc);
                rpc.scope = FWCTL_RPC_CONFIGURATION;
                rpc.in_len = in_size;
                rpc.out_len = out_size;
                rpc.in = (uint64_t)(uint64_t *)in;
                rpc.out = (uint64_t)(uint64_t *)out;

                rc = send_command(fd, &rpc, out);
                if (rc)
                        goto free_all;

                data = out->payload;
                val = le32toh(*(__le32 *)data);
                if (memcmp(&val, &expected_data, sizeof(val)) != 0) {
                        rc = -ENXIO;
                        goto free_all;
                }

        free_all:
                free(out);
        free_in:
                free(in);
                return rc;
        }

Take a look at CXL CLI test directory
<https://github.com/pmem/ndctl/tree/main/test/fwctl.c> for a detailed user code
for examples on how to exercise this path.


fwctl cxl Kernel API
====================

.. kernel-doc:: drivers/cxl/core/features.c
   :export:
.. kernel-doc:: include/cxl/features.h
+286 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===============
fwctl subsystem
===============

:Author: Jason Gunthorpe

Overview
========

Modern devices contain extensive amounts of FW, and in many cases, are largely
software-defined pieces of hardware. The evolution of this approach is largely a
reaction to Moore's Law where a chip tape out is now highly expensive, and the
chip design is extremely large. Replacing fixed HW logic with a flexible and
tightly coupled FW/HW combination is an effective risk mitigation against chip
respin. Problems in the HW design can be counteracted in device FW. This is
especially true for devices which present a stable and backwards compatible
interface to the operating system driver (such as NVMe).

The FW layer in devices has grown to incredible size and devices frequently
integrate clusters of fast processors to run it. For example, mlx5 devices have
over 30MB of FW code, and big configurations operate with over 1GB of FW managed
runtime state.

The availability of such a flexible layer has created quite a variety in the
industry where single pieces of silicon are now configurable software-defined
devices and can operate in substantially different ways depending on the need.
Further, we often see cases where specific sites wish to operate devices in ways
that are highly specialized and require applications that have been tailored to
their unique configuration.

Further, devices have become multi-functional and integrated to the point they
no longer fit neatly into the kernel's division of subsystems. Modern
multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
subsystems while sharing the underlying hardware using the auxiliary device
system.

All together this creates a challenge for the operating system, where devices
have an expansive FW environment that needs robust device-specific debugging
support, and FW-driven functionality that is not well suited to “generic”
interfaces. fwctl seeks to allow access to the full device functionality from
user space in the areas of debuggability, management, and first-boot/nth-boot
provisioning.

fwctl is aimed at the common device design pattern where the OS and FW
communicate via an RPC message layer constructed with a queue or mailbox scheme.
In this case the driver will typically have some layer to deliver RPC messages
and collect RPC responses from device FW. The in-kernel subsystem drivers that
operate the device for its primary purposes will use these RPCs to build their
drivers, but devices also usually have a set of ancillary RPCs that don't really
fit into any specific subsystem. For example, a HW RAID controller is primarily
operated by the block layer but also comes with a set of RPCs to administer the
construction of drives within the HW RAID.

In the past when devices were more single function, individual subsystems would
grow different approaches to solving some of these common problems. For instance
monitoring device health, manipulating its FLASH, debugging the FW,
provisioning, all have various unique interfaces across the kernel.

fwctl's purpose is to define a common set of limited rules, described below,
that allow user space to securely construct and execute RPCs inside device FW.
The rules serve as an agreement between the operating system and FW on how to
correctly design the RPC interface. As a uAPI the subsystem provides a thin
layer of discovery and a generic uAPI to deliver the RPCs and collect the
response. It supports a system of user space libraries and tools which will
use this interface to control the device using the device native protocols.

Scope of Action
---------------

fwctl drivers are strictly restricted to being a way to operate the device FW.
It is not an avenue to access random kernel internals, or other operating system
SW states.

fwctl instances must operate on a well-defined device function, and the device
should have a well-defined security model for what scope within the physical
device the function is permitted to access. For instance, the most complex PCIe
device today may broadly have several function-level scopes:

 1. A privileged function with full access to the on-device global state and
    configuration

 2. Multiple hypervisor functions with control over itself and child functions
    used with VMs

 3. Multiple VM functions tightly scoped within the VM

The device may create a logical parent/child relationship between these scopes.
For instance a child VM's FW may be within the scope of the hypervisor FW. It is
quite common in the VFIO world that the hypervisor environment has a complex
provisioning/profiling/configuration responsibility for the function VFIO
assigns to the VM.

Further, within the function, devices often have RPC commands that fall within
some general scopes of action (see enum fwctl_rpc_scope):

 1. Access to function & child configuration, FLASH, etc. that becomes live at a
    function reset. Access to function & child runtime configuration that is
    transparent or non-disruptive to any driver or VM.

 2. Read-only access to function debug information that may report on FW objects
    in the function & child, including FW objects owned by other kernel
    subsystems.

 3. Write access to function & child debug information strictly compatible with
    the principles of kernel lockdown and kernel integrity protection. Triggers
    a kernel Taint.

 4. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.

User space will provide a scope label on each RPC and the kernel must enforce the
above CAPs and taints based on that scope. A combination of kernel and FW can
enforce that RPCs are placed in the correct scope by user space.

Denied behavior
---------------

There are many things this interface must not allow user space to do (without a
Taint or CAP), broadly derived from the principles of kernel lockdown. Some
examples:

 1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with
    untrusted code, or otherwise compromise device or system security and
    integrity.

 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
    objects owned by kernel drivers.

 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
    driver can react to the device configuration at function reset/driver load
    time, but otherwise must not be coupled to fwctl.

 4. Operate the HW in a way that overlaps with the core purpose of another
    primary kernel subsystem, such as read/write to LBAs, send/receive of
    network packets, or operate an accelerator's data plane.

fwctl is not a replacement for device direct access subsystems like uacce or
VFIO.

Operations exposed through fwctl's non-taining interfaces should be fully
sharable with other users of the device. For instance exposing a RPC through
fwctl should never prevent a kernel subsystem from also concurrently using that
same RPC or hardware unit down the road. In such cases fwctl will be less
important than proper kernel subsystems that eventually emerge. Mistakes in this
area resulting in clashes will be resolved in favour of a kernel implementation.

fwctl User API
==============

.. kernel-doc:: include/uapi/fwctl/fwctl.h
.. kernel-doc:: include/uapi/fwctl/mlx5.h
.. kernel-doc:: include/uapi/fwctl/pds.h

sysfs Class
-----------

fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
operates the iotcl uAPI described above.

fwctl devices can be related to driver components in other subsystems through
sysfs::

    $ ls /sys/class/fwctl/fwctl0/device/infiniband/
    ibp0s10f0

    $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
    fwctl0/

    $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
    dev  device  power  subsystem  uevent

User space Community
--------------------

Drawing inspiration from nvme-cli, participating in the kernel side must come
with a user space in a common TBD git tree, at a minimum to usefully operate the
kernel driver. Providing such an implementation is a pre-condition to merging a
kernel driver.

The goal is to build user space community around some of the shared problems
we all have, and ideally develop some common user space programs with some
starting themes of:

 - Device in-field debugging

 - HW provisioning

 - VFIO child device profiling before VM boot

 - Confidential Compute topics (attestation, secure provisioning)

that stretch across all subsystems in the kernel. fwupd is a great example of
how an excellent user space experience can emerge out of kernel-side diversity.

fwctl Kernel API
================

.. kernel-doc:: drivers/fwctl/main.c
   :export:
.. kernel-doc:: include/linux/fwctl.h

fwctl Driver design
-------------------

In many cases a fwctl driver is going to be part of a larger cross-subsystem
device possibly using the auxiliary_device mechanism. In that case several
subsystems are going to be sharing the same device and FW interface layer so the
device design must already provide for isolation and cooperation between kernel
subsystems. fwctl should fit into that same model.

Part of the driver should include a description of how its scope restrictions
and security model work. The driver and FW together must ensure that RPCs
provided by user space are mapped to the appropriate scope. If the validation is
done in the driver then the validation can read a 'command effects' report from
the device, or hardwire the enforcement. If the validation is done in the FW,
then the driver should pass the fwctl_rpc_scope to the FW along with the command.

The driver and FW must cooperate to ensure that either fwctl cannot allocate
any FW resources, or any resources it does allocate are freed on FD closure.  A
driver primarily constructed around FW RPCs may find that its core PCI function
and RPC layer belongs under fwctl with auxiliary devices connecting to other
subsystems.

Each device type must be mindful of Linux's philosophy for stable ABI. The FW
RPC interface does not have to meet a strictly stable ABI, but it does need to
meet an expectation that userspace tools that are deployed and in significant
use don't needlessly break. FW upgrade and kernel upgrade should keep widely
deployed tooling working.

Development and debugging focused RPCs under more permissive scopes can have
less stabilitiy if the tools using them are only run under exceptional
circumstances and not for every day use of the device. Debugging tools may even
require exact version matching as they may require something similar to DWARF
debug information from the FW binary.

Security Response
=================

The kernel remains the gatekeeper for this interface. If violations of the
scopes, security or isolation principles are found, we have options to let
devices fix them with a FW update, push a kernel patch to parse and block RPC
commands or push a kernel patch to block entire firmware versions/devices.

While the kernel can always directly parse and restrict RPCs, it is expected
that the existing kernel pattern of allowing drivers to delegate validation to
FW to be a useful design.

Existing Similar Examples
=========================

The approach described in this document is not a new idea. Direct, or near
direct device access has been offered by the kernel in different areas for
decades. With more devices wanting to follow this design pattern it is becoming
clear that it is not entirely well understood and, more importantly, the
security considerations are not well defined or agreed upon.

Some examples:

 - HW RAID controllers. This includes RPCs to do things like compose drives into
   a RAID volume, configure RAID parameters, monitor the HW and more.

 - Baseboard managers. RPCs for configuring settings in the device and more

 - NVMe vendor command capsules. nvme-cli provides access to some monitoring
   functions that different products have defined, but more exist.

 - CXL also has a NVMe-like vendor command system.

 - DRM allows user space drivers to send commands to the device via kernel
   mediation

 - RDMA allows user space drivers to directly push commands to the device
   without kernel involvement

 - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc.

The first 4 are examples of areas that fwctl intends to cover. The latter three
are examples of denied behavior as they fully overlap with the primary purpose
of a kernel subsystem.

Some key lessons learned from these past efforts are the importance of having a
common user space project to use as a pre-condition for obtaining a kernel
driver. Developing good community around useful software in user space is key to
getting companies to fund participation to enable their products.
+14 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

Firmware Control (FWCTL) Userspace API
======================================

A framework that define a common set of limited rules that allows user space
to securely construct and execute RPCs inside device firmware.

.. toctree::
   :maxdepth: 1

   fwctl
   fwctl-cxl
   pds_fwctl
+46 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

================
fwctl pds driver
================

:Author: Shannon Nelson

Overview
========

The PDS Core device makes a fwctl service available through an
auxiliary_device named pds_core.fwctl.N.  The pds_fwctl driver binds to
this device and registers itself with the fwctl subsystem.  The resulting
userspace interface is used by an application that is a part of the
AMD Pensando software package for the Distributed Service Card (DSC).

The pds_fwctl driver has little knowledge of the firmware's internals.
It only knows how to send commands through pds_core's message queue to the
firmware for fwctl requests.  The set of fwctl operations available
depends on the firmware in the DSC, and the userspace application
version must match the firmware so that they can talk to each other.

When a connection is created the pds_fwctl driver requests from the
firmware a list of firmware object endpoints, and for each endpoint the
driver requests a list of operations for that endpoint.

Each operation description includes a firmware defined command attribute
that maps to the FWCTL scope levels.  The driver translates those firmware
values into the FWCTL scope values which can then be used for filtering the
scoped user requests.

pds_fwctl User API
==================

Each RPC request includes the target endpoint and the operation id, and in
and out buffer lengths and pointers.  The driver verifies the existence
of the requested endpoint and operations, then checks the request scope
against the required scope of the operation.  The request is then put
together with the request data and sent through pds_core's message queue
to the firmware, and the results are returned to the caller.

The RPC endpoints, operations, and buffer contents are defined by the
particular firmware package in the device, which varies across the
available product configurations.  The details are available in the
specific product SDK documentation.
Loading