Commit 9164e4a5 authored by Song Liu's avatar Song Liu
Browse files

Merge branch 'md-suspend-rewrite' into md-next

From Yu Kuai, written by Song Liu

Recent tests with raid10 revealed many issues with the following scenarios:

- add or remove disks to the array
- issue io to the array

At first, we fixed each problem independently respect that io can
concurrent with array reconfiguration. However, with more issues reported
continuously, I am hoping to fix these problems thoroughly.

Refer to how block layer protect io with queue reconfiguration (for
example, change elevator):

blk_mq_freeze_queue
-> wait for all io to be done, and prevent new io to be dispatched
// reconfiguration
blk_mq_unfreeze_queue

I think we can do something similar to synchronize io with array
reconfiguration.

Current synchronization works as the following. For the reconfiguration
operation:

1. Hold 'reconfig_mutex';
2. Check that rdev can be added/removed, one condition is that there is no
   IO (for example, check nr_pending).
3. Do the actual operations to add/remove a rdev, one procedure is
   set/clear a pointer to rdev.
4. Check if there is still no IO on this rdev, if not, revert the
   change.

IO path uses rcu_read_lock/unlock() to access rdev.

- rcu is used wrongly;
- There are lots of places involved that old rdev can be read, however,
many places doesn't handle old value correctly;
- Between step 3 and 4, if new io is dispatched, NULL will be read for
the rdev, and data will be lost if step 4 failed.

The new synchronization is similar to blk_mq_freeze_queue(). To add or
remove disk:

1. Suspend the array, that is, stop new IO from being dispatched
   and wait for inflight IO to finish.
2. Add or remove rdevs to array;
3. Resume the array;

IO path doesn't need to change for now, and all rcu implementation can
be removed.

Then main work is divided into 3 steps:

First, first make sure new apis to suspend the array is general:

- make sure suspend array will wait for io to be done(Done by [1]);
- make sure suspend array can be called for all personalities(Done by [2]);
- make sure suspend array can be called at any time(Done by [3]);
- make sure suspend array doesn't rely on 'reconfig_mutex'(PATCH 3-5);

Second replace old apis with new apis(PATCH 6-16). Specifically, the
synchronization is changed from:

  lock reconfig_mutex
  suspend array
  make changes
  resume array
  unlock reconfig_mutex

to:
   suspend array
   lock reconfig_mutex
   make changes
   unlock reconfig_mutex
   resume array

Finally, for the remain path that involved reconfiguration, suspend the
array first(PATCH 11,12, [4] and PATCH 17):

Preparatory work:
[1] https://lore.kernel.org/all/20230621165110.1498313-1-yukuai1@huaweicloud.com/
[2] https://lore.kernel.org/all/20230628012931.88911-2-yukuai1@huaweicloud.com/
[3] https://lore.kernel.org/all/20230825030956.1527023-1-yukuai1@huaweicloud.com/
[4] https://lore.kernel.org/all/20230825031622.1530464-1-yukuai1@huaweicloud.com/

* md-suspend-rewrite:
  md: rename __mddev_suspend/resume() back to mddev_suspend/resume()
  md: remove old apis to suspend the array
  md: suspend array in md_start_sync() if array need reconfiguration
  md/raid5: replace suspend with quiesce() callback
  md/md-linear: cleanup linear_add()
  md: cleanup mddev_create/destroy_serial_pool()
  md: use new apis to suspend array before mddev_create/destroy_serial_pool
  md: use new apis to suspend array for ioctls involed array reconfiguration
  md: use new apis to suspend array for adding/removing rdev from state_store()
  md: use new apis to suspend array for sysfs apis
  md/raid5: use new apis to suspend array
  md/raid5-cache: use new apis to suspend array
  md/md-bitmap: use new apis to suspend array for location_store()
  md/dm-raid: use new apis to suspend array
  md: add new helpers to suspend/resume and lock/unlock array
  md: add new helpers to suspend/resume array
  md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
  md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
  md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
parents 9e55a22f 2b16a525
Loading
Loading
Loading
Loading
+3 −7
Original line number Diff line number Diff line
@@ -3244,7 +3244,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
	set_bit(MD_RECOVERY_FROZEN, &rs->md.recovery);

	/* Has to be held on running the array */
	mddev_lock_nointr(&rs->md);
	mddev_suspend_and_lock_nointr(&rs->md);
	r = md_run(&rs->md);
	rs->md.in_sync = 0; /* Assume already marked dirty */
	if (r) {
@@ -3268,7 +3268,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
		}
	}

	mddev_suspend(&rs->md);
	set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags);

	/* Try to adjust the raid4/5/6 stripe cache size to the stripe size */
@@ -3798,9 +3797,7 @@ static void raid_postsuspend(struct dm_target *ti)
		if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
			md_stop_writes(&rs->md);

		mddev_lock_nointr(&rs->md);
		mddev_suspend(&rs->md);
		mddev_unlock(&rs->md);
		mddev_suspend(&rs->md, false);
	}
}

@@ -4059,8 +4056,7 @@ static void raid_resume(struct dm_target *ti)
		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
		mddev->ro = 0;
		mddev->in_sync = 0;
		mddev_resume(mddev);
		mddev_unlock(mddev);
		mddev_unlock_and_resume(mddev);
	}
}

+2 −2
Original line number Diff line number Diff line
@@ -175,7 +175,7 @@ static void __init md_setup_drive(struct md_setup_args *args)
		return;
	}

	err = mddev_lock(mddev);
	err = mddev_suspend_and_lock(mddev);
	if (err) {
		pr_err("md: failed to lock array %s\n", name);
		goto out_mddev_put;
@@ -221,7 +221,7 @@ static void __init md_setup_drive(struct md_setup_args *args)
	if (err)
		pr_warn("md: starting %s failed\n", name);
out_unlock:
	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
out_mddev_put:
	mddev_put(mddev);
}
+8 −10
Original line number Diff line number Diff line
@@ -1861,7 +1861,7 @@ void md_bitmap_destroy(struct mddev *mddev)

	md_bitmap_wait_behind_writes(mddev);
	if (!mddev->serialize_policy)
		mddev_destroy_serial_pool(mddev, NULL, true);
		mddev_destroy_serial_pool(mddev, NULL);

	mutex_lock(&mddev->bitmap_info.mutex);
	spin_lock(&mddev->lock);
@@ -1977,7 +1977,7 @@ int md_bitmap_load(struct mddev *mddev)
		goto out;

	rdev_for_each(rdev, mddev)
		mddev_create_serial_pool(mddev, rdev, true);
		mddev_create_serial_pool(mddev, rdev);

	if (mddev_is_clustered(mddev))
		md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes);
@@ -2348,11 +2348,10 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
{
	int rv;

	rv = mddev_lock(mddev);
	rv = mddev_suspend_and_lock(mddev);
	if (rv)
		return rv;

	mddev_suspend(mddev);
	if (mddev->pers) {
		if (mddev->recovery || mddev->sync_thread) {
			rv = -EBUSY;
@@ -2429,8 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
	}
	rv = 0;
out:
	mddev_resume(mddev);
	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
	if (rv)
		return rv;
	return len;
@@ -2539,7 +2537,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
	if (backlog > COUNTER_MAX)
		return -EINVAL;

	rv = mddev_lock(mddev);
	rv = mddev_suspend_and_lock(mddev);
	if (rv)
		return rv;

@@ -2564,16 +2562,16 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
	if (!backlog && mddev->serial_info_pool) {
		/* serial_info_pool is not needed if backlog is zero */
		if (!mddev->serialize_policy)
			mddev_destroy_serial_pool(mddev, NULL, false);
			mddev_destroy_serial_pool(mddev, NULL);
	} else if (backlog && !mddev->serial_info_pool) {
		/* serial_info_pool is needed since backlog is not zero */
		rdev_for_each(rdev, mddev)
			mddev_create_serial_pool(mddev, rdev, false);
			mddev_create_serial_pool(mddev, rdev);
	}
	if (old_mwb != backlog)
		md_bitmap_update_sb(mddev->bitmap);

	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
	return len;
}

+0 −2
Original line number Diff line number Diff line
@@ -183,7 +183,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
	 * in linear_congested(), therefore kfree_rcu() is used to free
	 * oldconf until no one uses it anymore.
	 */
	mddev_suspend(mddev);
	oldconf = rcu_dereference_protected(mddev->private,
			lockdep_is_held(&mddev->reconfig_mutex));
	mddev->raid_disks++;
@@ -192,7 +191,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
	rcu_assign_pointer(mddev->private, newconf);
	md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
	set_capacity_and_notify(mddev->gendisk, mddev->array_sectors);
	mddev_resume(mddev);
	kfree_rcu(oldconf, rcu);
	return 0;
}
+128 −105
Original line number Diff line number Diff line
@@ -206,8 +206,7 @@ static int rdev_need_serial(struct md_rdev *rdev)
 * 1. rdev is the first device which return true from rdev_enable_serial.
 * 2. rdev is NULL, means we want to enable serialization for all rdevs.
 */
void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
			      bool is_suspend)
void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
{
	int ret = 0;

@@ -215,15 +214,12 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
	    !test_bit(CollisionCheck, &rdev->flags))
		return;

	if (!is_suspend)
		mddev_suspend(mddev);

	if (!rdev)
		ret = rdevs_init_serial(mddev);
	else
		ret = rdev_init_serial(rdev);
	if (ret)
		goto abort;
		return;

	if (mddev->serial_info_pool == NULL) {
		/*
@@ -238,10 +234,6 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
			pr_err("can't alloc memory pool for serialization\n");
		}
	}

abort:
	if (!is_suspend)
		mddev_resume(mddev);
}

/*
@@ -250,8 +242,7 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
 * 2. when bitmap is destroyed while policy is not enabled.
 * 3. for disable policy, the pool is destroyed only when no rdev needs it.
 */
void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
			       bool is_suspend)
void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
{
	if (rdev && !test_bit(CollisionCheck, &rdev->flags))
		return;
@@ -260,8 +251,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
		struct md_rdev *temp;
		int num = 0; /* used to track if other rdevs need the pool */

		if (!is_suspend)
			mddev_suspend(mddev);
		rdev_for_each(temp, mddev) {
			if (!rdev) {
				if (!mddev->serialize_policy ||
@@ -283,8 +272,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
			mempool_destroy(mddev->serial_info_pool);
			mddev->serial_info_pool = NULL;
		}
		if (!is_suspend)
			mddev_resume(mddev);
	}
}

@@ -359,11 +346,11 @@ static bool is_suspended(struct mddev *mddev, struct bio *bio)
		return true;
	if (bio_data_dir(bio) != WRITE)
		return false;
	if (mddev->suspend_lo >= mddev->suspend_hi)
	if (READ_ONCE(mddev->suspend_lo) >= READ_ONCE(mddev->suspend_hi))
		return false;
	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
	if (bio->bi_iter.bi_sector >= READ_ONCE(mddev->suspend_hi))
		return false;
	if (bio_end_sector(bio) < mddev->suspend_lo)
	if (bio_end_sector(bio) < READ_ONCE(mddev->suspend_lo))
		return false;
	return true;
}
@@ -431,42 +418,73 @@ static void md_submit_bio(struct bio *bio)
	md_handle_request(mddev, bio);
}

/* mddev_suspend makes sure no new requests are submitted
 * to the device, and that any requests that have been submitted
 * are completely handled.
 * Once mddev_detach() is called and completes, the module will be
 * completely unused.
/*
 * Make sure no new requests are submitted to the device, and any requests that
 * have been submitted are completely handled.
 */
void mddev_suspend(struct mddev *mddev)
int mddev_suspend(struct mddev *mddev, bool interruptible)
{
	struct md_thread *thread = rcu_dereference_protected(mddev->thread,
			lockdep_is_held(&mddev->reconfig_mutex));
	int err = 0;

	WARN_ON_ONCE(thread && current == thread->tsk);
	if (mddev->suspended++)
		return;
	wake_up(&mddev->sb_wait);
	set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
	percpu_ref_kill(&mddev->active_io);
	/*
	 * hold reconfig_mutex to wait for normal io will deadlock, because
	 * other context can't update super_block, and normal io can rely on
	 * updating super_block.
	 */
	lockdep_assert_not_held(&mddev->reconfig_mutex);

	if (interruptible)
		err = mutex_lock_interruptible(&mddev->suspend_mutex);
	else
		mutex_lock(&mddev->suspend_mutex);
	if (err)
		return err;

	if (mddev->suspended) {
		WRITE_ONCE(mddev->suspended, mddev->suspended + 1);
		mutex_unlock(&mddev->suspend_mutex);
		return 0;
	}

	if (mddev->pers && mddev->pers->prepare_suspend)
		mddev->pers->prepare_suspend(mddev);
	percpu_ref_kill(&mddev->active_io);
	if (interruptible)
		err = wait_event_interruptible(mddev->sb_wait,
				percpu_ref_is_zero(&mddev->active_io));
	else
		wait_event(mddev->sb_wait,
				percpu_ref_is_zero(&mddev->active_io));
	if (err) {
		percpu_ref_resurrect(&mddev->active_io);
		mutex_unlock(&mddev->suspend_mutex);
		return err;
	}

	wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io));
	clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
	wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
	/*
	 * For raid456, io might be waiting for reshape to make progress,
	 * allow new reshape to start while waiting for io to be done to
	 * prevent deadlock.
	 */
	WRITE_ONCE(mddev->suspended, mddev->suspended + 1);

	del_timer_sync(&mddev->safemode_timer);
	/* restrict memory reclaim I/O during raid array is suspend */
	mddev->noio_flag = memalloc_noio_save();

	mutex_unlock(&mddev->suspend_mutex);
	return 0;
}
EXPORT_SYMBOL_GPL(mddev_suspend);

void mddev_resume(struct mddev *mddev)
{
	lockdep_assert_held(&mddev->reconfig_mutex);
	if (--mddev->suspended)
	lockdep_assert_not_held(&mddev->reconfig_mutex);

	mutex_lock(&mddev->suspend_mutex);
	WRITE_ONCE(mddev->suspended, mddev->suspended - 1);
	if (mddev->suspended) {
		mutex_unlock(&mddev->suspend_mutex);
		return;
	}

	/* entred the memalloc scope from mddev_suspend() */
	memalloc_noio_restore(mddev->noio_flag);
@@ -477,6 +495,8 @@ void mddev_resume(struct mddev *mddev)
	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
	md_wakeup_thread(mddev->thread);
	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */

	mutex_unlock(&mddev->suspend_mutex);
}
EXPORT_SYMBOL_GPL(mddev_resume);

@@ -672,6 +692,7 @@ int mddev_init(struct mddev *mddev)
	mutex_init(&mddev->open_mutex);
	mutex_init(&mddev->reconfig_mutex);
	mutex_init(&mddev->sync_mutex);
	mutex_init(&mddev->suspend_mutex);
	mutex_init(&mddev->bitmap_info.mutex);
	INIT_LIST_HEAD(&mddev->disks);
	INIT_LIST_HEAD(&mddev->all_mddevs);
@@ -2459,7 +2480,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
	pr_debug("md: bind<%s>\n", b);

	if (mddev->raid_disks)
		mddev_create_serial_pool(mddev, rdev, false);
		mddev_create_serial_pool(mddev, rdev);

	if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
		goto fail;
@@ -2512,7 +2533,7 @@ static void md_kick_rdev_from_array(struct md_rdev *rdev)
	bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
	list_del_rcu(&rdev->same_set);
	pr_debug("md: unbind<%pg>\n", rdev->bdev);
	mddev_destroy_serial_pool(rdev->mddev, rdev, false);
	mddev_destroy_serial_pool(rdev->mddev, rdev);
	rdev->mddev = NULL;
	sysfs_remove_link(&rdev->kobj, "block");
	sysfs_put(rdev->sysfs_state);
@@ -2842,11 +2863,7 @@ static int add_bound_rdev(struct md_rdev *rdev)
		 */
		super_types[mddev->major_version].
			validate_super(mddev, rdev);
		if (add_journal)
			mddev_suspend(mddev);
		err = mddev->pers->hot_add_disk(mddev, rdev);
		if (add_journal)
			mddev_resume(mddev);
		if (err) {
			md_kick_rdev_from_array(rdev);
			return err;
@@ -2983,11 +3000,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
		}
	} else if (cmd_match(buf, "writemostly")) {
		set_bit(WriteMostly, &rdev->flags);
		mddev_create_serial_pool(rdev->mddev, rdev, false);
		mddev_create_serial_pool(rdev->mddev, rdev);
		need_update_sb = true;
		err = 0;
	} else if (cmd_match(buf, "-writemostly")) {
		mddev_destroy_serial_pool(rdev->mddev, rdev, false);
		mddev_destroy_serial_pool(rdev->mddev, rdev);
		clear_bit(WriteMostly, &rdev->flags);
		need_update_sb = true;
		err = 0;
@@ -3599,6 +3616,7 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
	struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
	struct kernfs_node *kn = NULL;
	bool suspend = false;
	ssize_t rv;
	struct mddev *mddev = rdev->mddev;

@@ -3606,17 +3624,25 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
		return -EIO;
	if (!capable(CAP_SYS_ADMIN))
		return -EACCES;
	if (!mddev)
		return -ENODEV;

	if (entry->store == state_store && cmd_match(page, "remove"))
	if (entry->store == state_store) {
		if (cmd_match(page, "remove"))
			kn = sysfs_break_active_protection(kobj, attr);
		if (cmd_match(page, "remove") || cmd_match(page, "re-add") ||
		    cmd_match(page, "writemostly") ||
		    cmd_match(page, "-writemostly"))
			suspend = true;
	}

	rv = mddev ? mddev_lock(mddev) : -ENODEV;
	rv = suspend ? mddev_suspend_and_lock(mddev) : mddev_lock(mddev);
	if (!rv) {
		if (rdev->mddev == NULL)
			rv = -ENODEV;
		else
			rv = entry->store(rdev, page, length);
		mddev_unlock(mddev);
		suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
	}

	if (kn)
@@ -3921,7 +3947,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
	if (slen == 0 || slen >= sizeof(clevel))
		return -EINVAL;

	rv = mddev_lock(mddev);
	rv = mddev_suspend_and_lock(mddev);
	if (rv)
		return rv;

@@ -4014,7 +4040,6 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
	}

	/* Looks like we have a winner */
	mddev_suspend(mddev);
	mddev_detach(mddev);

	spin_lock(&mddev->lock);
@@ -4100,14 +4125,13 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
	blk_set_stacking_limits(&mddev->queue->limits);
	pers->run(mddev);
	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
	mddev_resume(mddev);
	if (!mddev->thread)
		md_update_sb(mddev, 1);
	sysfs_notify_dirent_safe(mddev->sysfs_level);
	md_new_event();
	rv = len;
out_unlock:
	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
	return rv;
}

@@ -4585,7 +4609,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
	    minor != MINOR(dev))
		return -EOVERFLOW;

	err = mddev_lock(mddev);
	err = mddev_suspend_and_lock(mddev);
	if (err)
		return err;
	if (mddev->persistent) {
@@ -4606,14 +4630,14 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
		rdev = md_import_device(dev, -1, -1);

	if (IS_ERR(rdev)) {
		mddev_unlock(mddev);
		mddev_unlock_and_resume(mddev);
		return PTR_ERR(rdev);
	}
	err = bind_rdev_to_array(rdev, mddev);
 out:
	if (err)
		export_rdev(rdev, mddev);
	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
	if (!err)
		md_new_event();
	return err ? err : len;
@@ -5179,7 +5203,8 @@ __ATTR(sync_max, S_IRUGO|S_IWUSR, max_sync_show, max_sync_store);
static ssize_t
suspend_lo_show(struct mddev *mddev, char *page)
{
	return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_lo);
	return sprintf(page, "%llu\n",
		       (unsigned long long)READ_ONCE(mddev->suspend_lo));
}

static ssize_t
@@ -5194,15 +5219,13 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
	if (new != (sector_t)new)
		return -EINVAL;

	err = mddev_lock(mddev);
	err = mddev_suspend(mddev, true);
	if (err)
		return err;

	mddev_suspend(mddev);
	mddev->suspend_lo = new;
	WRITE_ONCE(mddev->suspend_lo, new);
	mddev_resume(mddev);

	mddev_unlock(mddev);
	return len;
}
static struct md_sysfs_entry md_suspend_lo =
@@ -5211,7 +5234,8 @@ __ATTR(suspend_lo, S_IRUGO|S_IWUSR, suspend_lo_show, suspend_lo_store);
static ssize_t
suspend_hi_show(struct mddev *mddev, char *page)
{
	return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_hi);
	return sprintf(page, "%llu\n",
		       (unsigned long long)READ_ONCE(mddev->suspend_hi));
}

static ssize_t
@@ -5226,15 +5250,13 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
	if (new != (sector_t)new)
		return -EINVAL;

	err = mddev_lock(mddev);
	err = mddev_suspend(mddev, true);
	if (err)
		return err;

	mddev_suspend(mddev);
	mddev->suspend_hi = new;
	WRITE_ONCE(mddev->suspend_hi, new);
	mddev_resume(mddev);

	mddev_unlock(mddev);
	return len;
}
static struct md_sysfs_entry md_suspend_hi =
@@ -5482,7 +5504,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
	if (value == mddev->serialize_policy)
		return len;

	err = mddev_lock(mddev);
	err = mddev_suspend_and_lock(mddev);
	if (err)
		return err;
	if (mddev->pers == NULL || (mddev->pers->level != 1)) {
@@ -5491,15 +5513,13 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
		goto unlock;
	}

	mddev_suspend(mddev);
	if (value)
		mddev_create_serial_pool(mddev, NULL, true);
		mddev_create_serial_pool(mddev, NULL);
	else
		mddev_destroy_serial_pool(mddev, NULL, true);
		mddev_destroy_serial_pool(mddev, NULL);
	mddev->serialize_policy = value;
	mddev_resume(mddev);
unlock:
	mddev_unlock(mddev);
	mddev_unlock_and_resume(mddev);
	return err ?: len;
}

@@ -6262,7 +6282,7 @@ static void __md_stop_writes(struct mddev *mddev)
	}
	/* disable policy to guarantee rdevs free resources for serialization */
	mddev->serialize_policy = 0;
	mddev_destroy_serial_pool(mddev, NULL, true);
	mddev_destroy_serial_pool(mddev, NULL);
}

void md_stop_writes(struct mddev *mddev)
@@ -6554,13 +6574,13 @@ static void autorun_devices(int part)
		if (IS_ERR(mddev))
			break;

		if (mddev_lock(mddev))
		if (mddev_suspend_and_lock(mddev))
			pr_warn("md: %s locked, cannot run\n", mdname(mddev));
		else if (mddev->raid_disks || mddev->major_version
			 || !list_empty(&mddev->disks)) {
			pr_warn("md: %s already running, cannot run %pg\n",
				mdname(mddev), rdev0->bdev);
			mddev_unlock(mddev);
			mddev_unlock_and_resume(mddev);
		} else {
			pr_debug("md: created %s\n", mdname(mddev));
			mddev->persistent = 1;
@@ -6570,7 +6590,7 @@ static void autorun_devices(int part)
					export_rdev(rdev, mddev);
			}
			autorun_array(mddev);
			mddev_unlock(mddev);
			mddev_unlock_and_resume(mddev);
		}
		/* on success, candidates will be empty, on error
		 * it won't...
@@ -7120,7 +7140,6 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
			struct bitmap *bitmap;

			bitmap = md_bitmap_create(mddev, -1);
			mddev_suspend(mddev);
			if (!IS_ERR(bitmap)) {
				mddev->bitmap = bitmap;
				err = md_bitmap_load(mddev);
@@ -7130,11 +7149,8 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
				md_bitmap_destroy(mddev);
				fd = -1;
			}
			mddev_resume(mddev);
		} else if (fd < 0) {
			mddev_suspend(mddev);
			md_bitmap_destroy(mddev);
			mddev_resume(mddev);
		}
	}
	if (fd < 0) {
@@ -7423,7 +7439,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
			mddev->bitmap_info.space =
				mddev->bitmap_info.default_space;
			bitmap = md_bitmap_create(mddev, -1);
			mddev_suspend(mddev);
			if (!IS_ERR(bitmap)) {
				mddev->bitmap = bitmap;
				rv = md_bitmap_load(mddev);
@@ -7431,7 +7446,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
				rv = PTR_ERR(bitmap);
			if (rv)
				md_bitmap_destroy(mddev);
			mddev_resume(mddev);
		} else {
			/* remove the bitmap */
			if (!mddev->bitmap) {
@@ -7456,9 +7470,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
				module_put(md_cluster_mod);
				mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
			}
			mddev_suspend(mddev);
			md_bitmap_destroy(mddev);
			mddev_resume(mddev);
			mddev->bitmap_info.offset = 0;
		}
	}
@@ -7529,6 +7541,20 @@ static inline bool md_ioctl_valid(unsigned int cmd)
	}
}

static bool md_ioctl_need_suspend(unsigned int cmd)
{
	switch (cmd) {
	case ADD_NEW_DISK:
	case HOT_ADD_DISK:
	case HOT_REMOVE_DISK:
	case SET_BITMAP_FILE:
	case SET_ARRAY_INFO:
		return true;
	default:
		return false;
	}
}

static int __md_set_array_info(struct mddev *mddev, void __user *argp)
{
	mdu_array_info_t info;
@@ -7661,7 +7687,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
	if (!md_is_rdwr(mddev))
		flush_work(&mddev->sync_work);

	err = mddev_lock(mddev);
	err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) :
					   mddev_lock(mddev);
	if (err) {
		pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
			 err, cmd);
@@ -7789,7 +7816,10 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
	if (mddev->hold_active == UNTIL_IOCTL &&
	    err != -EINVAL)
		mddev->hold_active = 0;

	md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) :
				     mddev_unlock(mddev);

out:
	if(did_set_md_closing)
		clear_bit(MD_CLOSING, &mddev->flags);
@@ -9323,7 +9353,12 @@ static void md_start_sync(struct work_struct *ws)
{
	struct mddev *mddev = container_of(ws, struct mddev, sync_work);
	int spares = 0;
	bool suspend = false;

	if (md_spares_need_change(mddev))
		suspend = true;

	suspend ? mddev_suspend_and_lock_nointr(mddev) :
		  mddev_lock_nointr(mddev);

	if (!md_is_rdwr(mddev)) {
@@ -9360,7 +9395,7 @@ static void md_start_sync(struct work_struct *ws)
		goto not_running;
	}

	mddev_unlock(mddev);
	suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
	md_wakeup_thread(mddev->sync_thread);
	sysfs_notify_dirent_safe(mddev->sysfs_action);
	md_new_event();
@@ -9372,7 +9407,7 @@ static void md_start_sync(struct work_struct *ws)
	clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
	clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
	mddev_unlock(mddev);
	suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);

	wake_up(&resync_wait);
	if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
@@ -9404,19 +9439,7 @@ static void md_start_sync(struct work_struct *ws)
 */
void md_check_recovery(struct mddev *mddev)
{
	if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) {
		/* Write superblock - thread that called mddev_suspend()
		 * holds reconfig_mutex for us.
		 */
		set_bit(MD_UPDATING_SB, &mddev->flags);
		smp_mb__after_atomic();
		if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags))
			md_update_sb(mddev, 0);
		clear_bit_unlock(MD_UPDATING_SB, &mddev->flags);
		wake_up(&mddev->sb_wait);
	}

	if (is_md_suspended(mddev))
	if (READ_ONCE(mddev->suspended))
		return;

	if (mddev->bitmap)
Loading