OS-7982: disabled resilver_defer feature leads to looping resilvers


Issue Type:Bug
Priority:2 - Critical
Created at:2019-09-07T00:13:49.722Z
Updated at:2020-01-01T23:03:13.215Z


Created by:Former user
Reported by:Former user
Assigned to:Former user


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-09-09T23:43:37.626Z)


In production we had a machine reboot. When it rebooted ZFS wasn't able to open one of the disks, so a spare was used. We later replaced the faulty drive.

Operators noticed that the resilver wasn't making any progress. In fact, the resilver was restarting every few seconds. A user on the mailing list noted this as well: https://smartos.topicbox.com/groups/smartos-discuss/Tfd0fc3c68cacc765-M6cfc1a2787c5cc774ba92f76/strange-zfs-resilvering-never-stops-and-seems-to-restart

It appears that the new deferred resilver code introduced a bug where a resilver is requested to be started (or restarted) even though a resilver is currently running.


Comment by Former user
Created at 2019-09-07T00:17:18.733Z

I believe this is a simple fix. All we need to do is make sure the deferred_resilver feature is enabled before we request another resilver.

This is the code in question that is kicking off new resilvers: https://github.com/joyent/illumos-joyent/blob/1cc204b97b9317e681958da0e91abeb27bcc6f82/usr/src/uts/common/fs/zfs/dsl_scan.c#L950-L968

Comment by Former user
Created at 2019-09-07T00:30:52.211Z

I should add that the feature has to be present, but not enabled. As in, the PI has to be at least 20190605T231205Z, but with a pool that hasn't been upgraded.

Comment by Former user
Created at 2019-09-09T16:47:54.871Z

Here are some preliminary testing notes.

I created a testpool with a mirror pair and a hot spare. I populated the pool with 350 filesystems with 12.5M of data each, ~5.5G total.

I later also upgraded testpool to verify the problem doesn't exist when the pool has the defer_resilver feature enabled on a bugged PI.

    scan: resilver in progress since Mon Sep  9 16:29:39 2019                           ### start
    scan: resilvered 5.54G in 0 days 00:01:15 with 0 errors on Mon Sep  9 16:30:54 2019 ### end

So that's a possible workaround if booting back to avoid the issue isn't feasible.

Comment by Former user
Created at 2019-09-09T19:09:56.662Z

I just finished off a run of the ZTS on my smartos VM. Two tests are failing.

root@coke:/opt/zfs-tests# grep '\[FAIL\]' /var/tmp/test_results/20190909T172817/log
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_missing_003_pos (run as root) [01:16] [FAIL]
Test: /opt/zfs-tests/tests/functional/mmp/mmp_on_zdb (run as root) [00:14] [FAIL]

Both of these tests fail upstream prior to this change as well and in the same way.

Comment by Jira Bot
Created at 2019-09-09T19:20:16.832Z

illumos-joyent commit b67d8733333b09a4f54e1f305953f4186981cdb4 (branch master, by Kody A Kantor)

OS-7982 disabled resilver_defer feature leads to looping resilvers
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>