Commit 9fc75b71 authored May 19, 2026 by Ilya Dryomov

rbd: eliminate a race in lock_dwork draining on unmap



Given how rbd_lock_add_request() and rbd_img_exclusive_lock() are
written, lock_dwork may be (re)queued more than it's actually needed:
for example in case a new I/O request comes in while we are in the
middle of rbd_acquire_lock() on behalf of another I/O request.  This is
expected and with rbd_release_lock() preemptively canceling lock_dwork
is benign under normal operation.

A more problematic example is maybe_kick_acquire():

    if (have_requests || delayed_work_pending(&rbd_dev->lock_dwork)) {
            dout("%s rbd_dev %p kicking lock_dwork\n", __func__, rbd_dev);
            mod_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
    }

It's not unrealistic for lock_dwork to get canceled right after
delayed_work_pending() returns true and for mod_delayed_work() to
requeue it right there anyway.  This is a classic TOCTOU race.

When it comes to unmapping the image, there is an implicit assumption
of no self-initiated exclusive lock activity past the point of return
from rbd_dev_image_unlock() which unlocks the lock if it happens to be
held.  This unlock is assumed to be final and lock_dwork (as well as
all other exclusive lock tasks, really) isn't expected to get queued
again.  However, lock_dwork is canceled only in cancel_tasks_sync()
(i.e. later in the unmap sequence) and on top of that the cancellation
can get in effect nullified by maybe_kick_acquire().  This may result
in rbd_acquire_lock() executing after rbd_dev_device_release() and
rbd_dev_image_release() run and free and/or reset a bunch of things.
One of the possible failure modes then is a violated

    rbd_assert(rbd_image_format_valid(rbd_dev->image_format));

in rbd_dev_header_info() which is called via rbd_dev_refresh() from
rbd_post_acquire_action().

Redo exclusive lock task draining to provide saner semantics and try
to meet the assumptions around rbd_dev_image_unlock().

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

parent 5200f5f4

drivers/block/rbd.c

+8 −12

Original line number	Diff line number	Diff line
		@@ -4565,24 +4565,12 @@ static int rbd_register_watch(struct rbd_device *rbd_dev)
		return ret;
		}

		static void cancel_tasks_sync(struct rbd_device *rbd_dev)
		{
		dout("%s rbd_dev %p\n", __func__, rbd_dev);

		cancel_work_sync(&rbd_dev->acquired_lock_work);
		cancel_work_sync(&rbd_dev->released_lock_work);
		cancel_delayed_work_sync(&rbd_dev->lock_dwork);
		cancel_work_sync(&rbd_dev->unlock_work);
		}

		/*
		* header_rwsem must not be held to avoid a deadlock with
		* rbd_dev_refresh() when flushing notifies.
		*/
		static void rbd_unregister_watch(struct rbd_device *rbd_dev)
		{
		cancel_tasks_sync(rbd_dev);

		mutex_lock(&rbd_dev->watch_mutex);
		if (rbd_dev->watch_state == RBD_WATCH_STATE_REGISTERED)
		__rbd_unregister_watch(rbd_dev);
		@@ -6548,10 +6536,18 @@ static int rbd_add_parse_args(const char *buf,

		static void rbd_dev_image_unlock(struct rbd_device *rbd_dev)
		{
		dout("%s rbd_dev %p\n", __func__, rbd_dev);

		disable_delayed_work_sync(&rbd_dev->lock_dwork);
		disable_work_sync(&rbd_dev->unlock_work);

		down_write(&rbd_dev->lock_rwsem);
		if (__rbd_is_lock_owner(rbd_dev))
		__rbd_release_lock(rbd_dev);
		up_write(&rbd_dev->lock_rwsem);

		flush_work(&rbd_dev->acquired_lock_work);
		flush_work(&rbd_dev->released_lock_work);
		}

		/*