OS-7920: ldi notifications can trigger vdev_disk_free() without taking the required locks

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Resolved
Created at:2019-08-01T14:46:50.753Z
Updated at:2020-02-05T19:09:42.018Z

People

Created by:Mike Gerdts
Reported by:Mike Gerdts
Assigned to:Jason King

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-02-04T14:17:58.405Z)

Fix Versions

2020-02-13 Lloyd Braun (Release Date: 2020-02-13)

Related Issues

Description

There have been several occurrences of panics involving the following stack at about the same time as disk errors are logged. The first two arguments to ldi_ioctl vary. The first is always a garbage pointer and the second is a valid DKIO.

ffffd003d5e7b990 ldi_ioctl+0x2e(5ca96e63, 425, ffffd003d5e7b9e8, 80000000, ffffd0c4fcf1ae18, 0)
ffffd003d5e7ba60 vdev_disk_open+0x4f1(ffffd0c578140680, ffffd003d5e7ba98, ffffd003d5e7ba90, ffffd003d5e7ba88)
ffffd003d5e7bad0 vdev_open+0xdd(ffffd0c578140680)
ffffd003d5e7bb00 vdev_reopen+0x3b(ffffd0c578140680)
ffffd003d5e7bb40 spa_async_probe+0x8a(ffffd0c571f6a000, ffffd0c578140680)
ffffd003d5e7bb80 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c578142640)
ffffd003d5e7bbc0 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c5766a7d40)
ffffd003d5e7bc20 spa_async_thread+0x281(ffffd0c571f6a000)
ffffd003d5e7bc30 thread_start+8()

There are three calls to ldi_ioctl in vdev_disk_open:

335         vdev_disk_alloc(vd);
336         dvd = vd->vdev_tsd;
...
568         if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT,
...
573         } else if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFO,
...
581         } else {
...
586         }
...
590         if (vd->vdev_wholedisk == 1) {
...
607                 (void) ldi_ioctl(dvd->vd_lh, DKIOCSETWCE, (intptr_t)&wce,
608                     FKIOCTL, kcred, NULL);
609         }

In this case, we are calling ldi_ioctl(<bad_address>, DKIOSETWCE, ...), the call at line 607. In order to get here, we've made at least one other call to ldi_ioctl(dvd->vd_lh, ...) with a valid address in dvd->vd_lh. This appears to be a use-after-free, and is supported by:

> ffffd0c578140680::print vdev_t vdev_tsd
vdev_tsd = 0

The only place that vdev_tsd is set to NULL is in vdev_disk_free, which is called only by:

   1    172  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_off_finalize>>
             vdev_disk_free(vd);
   2    325  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_open>>
             vdev_disk_free(vd);
   3    653  usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_close>>
             vdev_disk_free(vd);

vdev_disk_off_finalize is only called as an ldi callback. vdev_disk_open is not suspect as a race involving itself would be blocked by spa_config_lock taken higher in the stack. vdev_disk_close may be called via a vdev_disk_ops callback or via an ldi callback. The ldi callbacks are quite suspicious.

184 static ldi_ev_callback_t vdev_disk_off_callb = {
185         .cb_vers = LDI_EV_CB_VERS,
186         .cb_notify = vdev_disk_off_notify,
187         .cb_finalize = vdev_disk_off_finalize
188 };

Neither of those callbacks call spa_config_enter, allowing their manipulation of the vdev_disk to race with vdev_disk_open.

Comments

Comment by Mike Gerdts
Created at 2019-08-01T15:03:59.021Z

Naïvely, the fix could be to simply add calls to:

spa_config_enter(vd->vd_spa, SCL_STATE_ALL, RW_WRITER);

in those functions. It is not clear to me, however, that this is the right thing to do when it is racing with vdev_reopen.

In any case, vdev_disk_close would do well to call:

VERIFY3S(spa_config_held(vd->vd_spa, SCL_STATE_ALL, RW_WRITER), ==, SCL_STATE_ALL);

Comment by Mike Gerdts
Created at 2019-12-11T14:09:03.201Z
Updated at 2019-12-11T23:33:31.392Z

This note was written as tests were running in anticipation that all would go well. The zfs tests initially led to memory exhaustion on a too-small VM. The dtrace watcher script shows that zfs tests never exercised this code path. The loop at the end of this note has been shown to bring instability to debug and non-debug kernels. The general pattern is that spa_sync() has a read lock when vdisk_offline_finalize() tries to take the write lock. I suspect that there is something in the vdisk_offline_finalize() stack that holds a lock that some zio needs.

The fix

I added calls to spa_config_enter() and spa_config_exit() to vdev_disk_off_notify() and vdev_disk_off_finalize() as I suggested in my initial analysis. There are legitimate callers to vdev_disk_close() that do not hold the SCL_STATE_ALL set of locks, so the VERIFY() I suggested is in appropriate.

Testing

During all testing I have had the following dtrace script running to watch for times when I hit this code path.

#! /usr/sbin/dtrace -s

vdev_disk_off_notify:entry,
vdev_disk_off_finalize:entry
{
	self->interest = 1;
}

ldi_ev_get_type:return
/self->interest/
{
	printf("%Y %s", walltimestamp, stringof(arg1));
	stack();
	self->interest = 0;
}

zfstest

The ZFS test suite on debug is underway. This space will be updated.

Manual testing

I found the easiest way to trigger this code path was with iscsi. Offlining an LU at the target can trigger this code path.

In the setup described below, the target and initiator are on two different systems. While it may be possible to have them on the same system, this could lead to same-kernel zfs-on-zfs deadlocks.

The iscsi target and initiator settings do not survive reboot on SmartOS. After a reboot, the svcadm, itadm, stmfadm, and iscsiadm setup commands will need to be repeated. After they are executed, any pool that had previously existed on the LUs (but not destroyed) may be imported.

target setup

[root@cn2 (coal-1) ~]# svcadm enable -r iscsi/target
[root@cn2 (coal-1) ~]# itadm create-target
Target iqn.2010-08.org.illumos:02:55bbfe58-53aa-4921-bae5-eda0ea7e2f33 successfully created
[root@cn2 (coal-1) ~]# zfs create zones/zfstest
[root@cn2 (coal-1) ~]# zfs set compression=on zones/zfstest
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun0
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun1
[root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun2

[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun0
Logical unit created: 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun1
Logical unit created: 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun2
Logical unit created: 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm list-lu
LU Name: 600144F00C9D700000005DF0EC0B0001
LU Name: 600144F00C9D700000005DF0EC0D0002
LU Name: 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0B0001
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# stmfadm list-lu | awk '{print "stmfadm add-view", $NF}' | sh -ex
+ stmfadm add-view 600144F00C9D700000005DF0EC0B0001
+ stmfadm add-view 600144F00C9D700000005DF0EC0D0002
+ stmfadm add-view 600144F00C9D700000005DF0EC0E0003

[root@cn2 (coal-1) ~]# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
e1000g0: flags=1001004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,FIXEDMTU> mtu 1500 index 2
        inet 10.99.99.40 netmask ffffff00 broadcast 10.99.99.255
        ether 52:54:0:48:bb:8c
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128

initiator setup

[root@cn3 (coal-1) ~]# svcadm enable -rs iscsi/initiator
[root@cn3 (coal-1) ~]# iscsiadm modify discovery -t enable
[root@cn3 (coal-1) ~]# iscsiadm add discovery-address 10.99.99.40

[root@cn3 (coal-1) ~]# devfsadm
Could not open /etc/saf/zsmon/_pmtab
[root@cn3 (coal-1) ~]# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t600144F00C9D700000005DF0EC0B0001d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0b0001
       1. c0t600144F00C9D700000005DF0EC0D0002d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0d0002
       2. c0t600144F00C9D700000005DF0EC0E0003d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32>
          /scsi_vhci/disk@g600144f00c9d700000005df0ec0e0003
       3. c1d0 <QEMU HARDDISK=QM00001-QM00001-0001-40.00GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
       4. c2t0d0 <Virtio-Block Device-0000-1.00GB>
          /pci@0,0/pci1af4,2@9/blkdev@0,0
       5. c3t0d0 <Virtio-Block Device-0000-1.00GB>
          /pci@0,0/pci1af4,2@a/blkdev@0,0
Specify disk (enter its number): ^D

[root@cn3 (coal-1) ~]# zpool create iscsi \
   mirror c0t600144F00C9D700000005DF0EC0B0001d0 c0t600144F00C9D700000005DF0EC0D0002d0 \
   spare c0t600144F00C9D700000005DF0EC0E0003d0

Exercise on debug build

The initiator system is running the 20191211T012722Z debug build. I use four windows:

1. On the initiator CN, run ./watch-OS-7920 (the dtrace script mentioned above)
2. On the initiator CN, run while true; do dd if=/dev/zero of=/iscsi/file bs=1024k count=100
3. On the initiator CN, run other commands
4. On the target CN, disable and enable an LU

Start the watcher in window 1 and the dd loop in window2. Wait for dd to report stats at least once.

In window 4 (target system), offline one of the LUs that makes up the mirror.

[root@cn2 (coal-1) ~]# stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002

In window 1, notice that the changed code path was hit:

[root@cn3 (coal-1) /var/tmp/OS-7920]# ./watch-OS-7920
dtrace: script './watch-OS-7920' matched 3 probes


CPU     ID                    FUNCTION:NAME
  0  20147           ldi_ev_get_type:return 2019 Dec 11 13:28:41 LDI:EVENT:OFFLINE
              zfs`vdev_disk_off_notify+0x37
              genunix`ldi_invoke_notify+0x11a
              genunix`e_ddi_offline_notify+0x8b
              genunix`devi_detach_node+0x89
              genunix`ndi_devi_offline+0x13d
              genunix`i_mdi_pi_state_change+0x5d9
              genunix`mdi_pi_offline+0x3d
              iscsi`iscsi_lun_offline+0xa6
              iscsi`iscsi_lun_destroy+0x43
              iscsi`iscsi_sess_reportluns+0x3cc
              iscsi`iscsi_sess_enumeration+0xb6
              genunix`taskq_thread+0x315
              unix`thread_start+0xb

  3  20147           ldi_ev_get_type:return 2019 Dec 11 13:28:43 LDI:EVENT:OFFLINE
              zfs`vdev_disk_off_finalize+0x3a
              genunix`ldi_invoke_finalize+0x132
              genunix`e_ddi_offline_finalize+0x79
              genunix`devi_detach_node+0x1d7
              genunix`ndi_devi_offline+0x13d
              genunix`i_mdi_pi_state_change+0x5d9
              genunix`mdi_pi_offline+0x3d
              iscsi`iscsi_lun_offline+0xa6
              iscsi`iscsi_lun_destroy+0x43
              iscsi`iscsi_sess_reportluns+0x3cc
              iscsi`iscsi_sess_enumeration+0xb6
              genunix`taskq_thread+0x315
              unix`thread_start+0xb


In window 3, observe that the spare was added to the mirror.

[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:

        NAME                                         STATE     READ WRITE CKSUM
        iscsi                                        DEGRADED     0     0     0
          mirror-0                                   DEGRADED     0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0    ONLINE       0     0     0
            spare-1                                  DEGRADED     0     0     0
              c0t600144F00C9D700000005DF0EC0D0002d0  REMOVED      0     0     0
              c0t600144F00C9D700000005DF0EC0E0003d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0      INUSE     currently in use

errors: No known data errors
...

In window 4, online the LU

[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002

In window 3, observe that the pool returns to its initial configuration

[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: ONLINE
  scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019
config:

        NAME                                         STATE     READ WRITE CKSUM
        iscsi                                        ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0    ONLINE       0     0     0
            spare-1                                  ONLINE       0     0     0
              c0t600144F00C9D700000005DF0EC0D0002d0  ONLINE       0     0     0
              c0t600144F00C9D700000005DF0EC0E0003d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0      INUSE     currently in use

errors: No known data errors
...

[root@cn3 (coal-1) ~]# zpool status
  pool: iscsi
 state: ONLINE
  scan: resilvered 333K in 0 days 00:00:01 with 0 errors on Wed Dec 11 13:29:17 2019
config:

        NAME                                       STATE     READ WRITE CKSUM
        iscsi                                      ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0B0001d0  ONLINE       0     0     0
            c0t600144F00C9D700000005DF0EC0D0002d0  ONLINE       0     0     0
        spares
          c0t600144F00C9D700000005DF0EC0E0003d0    AVAIL

errors: No known data errors
...

After doing this once manually, I put the offline/online into a loop:

[root@cn2 (coal-1) ~]# (set -ex; while true; do stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002; sleep 15; stmfadm online-lu 600144F00C9D700000005DF0EC0D0002; sleep 45; done)
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
+ true
+ stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 15
+ stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
+ sleep 45
...

Also, I watched the pool cycle through its various stages of self-healing with:

[root@cn3 (coal-1) ~]# while true; do date; zpool status iscsi; sleep 10; done

Comment by Mike Gerdts
Created at 2019-12-11T23:49:25.883Z
Updated at 2019-12-11T23:51:18.795Z

Multiple systems in a "we just repaired hardware state" have hit NULL pointer dereferences in zvol_dumpio_vdev. In one instance it led to a panic loop as described in OPS-4518 and another one panic was followed by a panic loop that looks like that described in this ticket.

OPS-4518 had a stack that looked like:

JA701916 ttyb login: 2018-08-30T12:39:2.778618+00:00 JA701916 scsi: [ID 107833 kern.warning] WARNING: /pci@cd,0/pci8086,2030@0/pci15d9,808@0/iport@8/disk@w5000cca02dc51559,0 (sd8):#012#011Unable to clean up memory because of pending I/O#012

panic[cpu0]/thread=fffffb04105eec40: BAD TRAP: type=e (#pf Page fault) rp=fffffb04105ee440 addr=10 occurred in module "zfs" due to a NULL pointer dereference

fsflush: #pf Page fault
Bad kernel fault at addr=0x10
pid=3, pc=0xfffffffff7e2728f, sp=0xfffffb04105ee530, eflags=0x10206
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3606f8<smap,smep,osxsav,pcide,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 10  cr3: 1e000000  cr8: 0

        rdi: fffffdc3e5d2b800 rsi: fffffafa95e61000 rdx:             1000
        rcx:       221dcd1000  r8:               c0  r9:                0
        rax:                0 rbx: fffffdc3e5d2b800 rbp: fffffb04105ee5c0
        r10: fffffffffb870be8 r11:                0 r12:                0
        r13:                0 r14:                0 r15:       221dcd1000
        fsb:                0 gsb: fffffffffbc4c000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffff7e2728f
         cs:               30 rfl:            10206 rsp: fffffb04105ee530
         ss:               38

fffffb04105ee330 unix:real_mode_stop_cpu_stage2_end+bc1c ()
fffffb04105ee430 unix:trap+15b1 ()
fffffb04105ee440 unix:_cmntrap+e6 ()
fffffb04105ee5c0 zfs:zvol_dumpio_vdev+10f ()
fffffb04105ee660 zfs:zvol_dumpio_vdev+198 ()
fffffb04105ee700 zfs:zvol_dumpio+133 ()
fffffb04105ee7a0 zfs:zvol_strategy+2e5 ()
fffffb04105ee7d0 genunix:bdev_strategy+61 ()
fffffb04105ee830 specfs:spec_startio+8e ()
fffffb04105ee8e0 specfs:spec_putapage+de ()
fffffb04105ee990 specfs:spec_putpage+1dd ()
fffffb04105eea10 genunix:fop_putpage+4c ()
fffffb04105eea90 genunix:fsflush_do_pages+4a2 ()
fffffb04105eeb30 genunix:fsflush+49a ()
fffffb04105eeb40 unix:thread_start+8 ()

dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel

The other panic on 20190228T233723Z has the same stack. For the frames in the zfs module, the offsets are the same.

For the 20190228 release, zvol_dumpio_vdev is:

> ::status
debugging live kernel (64-bit) on cn4
operating system: 5.11 joyent_20190228T233723Z (i86pc)
git branch: release-20190228
git rev: 7895a719812f7049054cb9b3fb04d28893f0af16
image uuid: (not set)
> zvol_dumpio_vdev::dis
...
zvol_dumpio_vdev+0xfa:          movq   -0x78(%rbp),%rdx
zvol_dumpio_vdev+0xfe:          sbbl   %r8d,%r8d
zvol_dumpio_vdev+0x101:         movq   -0x70(%rbp),%rsi
zvol_dumpio_vdev+0x105:         movq   %r15,%rcx
zvol_dumpio_vdev+0x108:         andl   $0xc0,%r8d
zvol_dumpio_vdev+0x10f:         movq   0x10(%rax),%rdi
zvol_dumpio_vdev+0x113:         addl   $0x40,%r8d
zvol_dumpio_vdev+0x117:         call   -0x285dc <vdev_disk_ldi_physio>
...

That corresponds to this:

1173         } else {
1174                 dvd = vd->vdev_tsd;
1175                 ASSERT3P(dvd, !=, NULL);
1176                 return (vdev_disk_ldi_physio(dvd->vd_lh, addr, size,
1177                     offset, doread ? B_READ : B_WRITE));
1178         }

From this we can gather that dvd is NULL. We should be safe using dvd because zvol_dumpio() took a read lock at line 1208. I have looked to verify that there could not be a race with ddi_in_panic() - if it returns true, all CPUs other than the one running the panic thread are disabled. The panics described in this comment are always a fsflush() thread that is not panicking.

1207         if (!ddi_in_panic())
1208                 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER);
1209
1210         vd = vdev_lookup_top(spa, DVA_GET_VDEV(&ze->ze_dva));
1211         offset += DVA_GET_OFFSET(&ze->ze_dva);
1212         error = zvol_dumpio_vdev(vd, addr, offset, DVA_GET_OFFSET(&ze->ze_dva),
1213             size, doread, isdump);
1214
1215         if (!ddi_in_panic())
1216                 spa_config_exit(spa, SCL_STATE, FTAG);

This leads us back to "what could modify vd->vdev_tsd?" We get back to one of the stars of this bug, vdev_disk_free().

 94 static void
 95 vdev_disk_free(vdev_t *vd)
 96 {
 ...
113         kmem_free(dvd, sizeof (vdev_disk_t));
114         vd->vdev_tsd = NULL;
115 }

vdev_disk_free() is called only in an error path of vdev_disk_open() and by vdev_disk_close(). Notably, vdev_disk_off_notify() does not hold a lock when it calls vdev_disk_close().

Therefore, this bug is almost certainly the root cause of the panic loop seen in OPS-4518.


Comment by Mike Gerdts
Created at 2019-12-14T16:56:33.589Z

With a plan to commit the fix into illumos, I switched over to using OmniOS with current illumos-gate. There I've hit a couple different panics than what I've hit on SmartOS. I've hit one in abd_free_linear many times but have not been able to confidently point my finger at this as the cause. See illumos#12076.

Another one, however, I can confidently say is caused by this.

> ::status
debugging crash dump vmcore.7 (64-bit) from omni-2
operating system: 5.11 omni-physio-0-g59a9ce0e70 (i86pc)
build version: gfx-drm - heads/master-0-gbdc58b1-dirty

image uuid: b1d960bc-afbe-43c0-eea2-a975cb6312b2
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008a99710 addr=ffffff094ae64778
dump content: kernel pages only
> $C
ffffff0008a99820 bdev_strategy+0x26(ffffff029ee775c0)
ffffff0008a99850 ldi_strategy+0x47(ffffff03086d9600, ffffff029ee775c0)
ffffff0008a998c0 vdev_disk_io_start+0x254(ffffff02733b45b8)
ffffff0008a99920 zio_vdev_io_start+0xdf(ffffff02733b45b8)
ffffff0008a99960 zio_execute+0xf5(ffffff02733b45b8)
ffffff0008a99990 zio_nowait+0x87(ffffff02733b45b8)
ffffff0008a999d0 vdev_queue_io_done+0x92(ffffff030f504648)
ffffff0008a99a10 zio_vdev_io_done+0xc8(ffffff030f504648)
ffffff0008a99a50 zio_execute+0xf5(ffffff030f504648)
ffffff0008a99b00 taskq_thread+0x315(ffffff03086e0ea0)
ffffff0008a99b10 thread_start+0xb()
>

This was purportedly fixed in OS-7797, but my workspace has that fix. It seems to have just tightened the race. Here we can see that the OS-7797 fix is present (failure of ldi_strategy() now calls zio_interrupt()) we still ran into the disk being offline and vd_lh being NULL.

> vdev_disk_io_start+0x254::dis
...
vdev_disk_io_start+0x24f:       call   +0x442223c       <ldi_strategy>
vdev_disk_io_start+0x254:       testl  %eax,%eax
vdev_disk_io_start+0x256:       je     -0xea    <vdev_disk_io_start+0x172>
vdev_disk_io_start+0x25c:       jmp    -0x101   <vdev_disk_io_start+0x160>
...
> vdev_disk_io_start+0x160::dis
...
vdev_disk_io_start+0x160:       movl   $0x6,0x31c(%rbx)
vdev_disk_io_start+0x16a:       movq   %rbx,%rdi
vdev_disk_io_start+0x16d:       call   -0x14cb2 <zio_interrupt>
...
> ffffff02733b45b8::print zio_t io_vd | ::print vdev_t vdev_tsd | ::print vdev_disk_t
{
    vd_devid = 0
    vd_minor = 0
    vd_lh = 0
    vd_ldi_cbs = {
        list_size = 0x18
        list_offset = 0
        list_head = {
            list_next = 0xffffff026e2ad008
            list_prev = 0xffffff18277fc4b8
        }
    }
    vd_ldi_offline = 0x1 (B_TRUE)
}

That is:

 893 static void
 894 vdev_disk_io_start(zio_t *zio)
 895 {
 896         vdev_t *vd = zio->io_vd;
 897         vdev_disk_t *dvd = vd->vdev_tsd;
 898         unsigned long trim_flags = 0;
 899         vdev_buf_t *vb;
 900         struct dk_callback *dkc;
 901         buf_t *bp;
 902         int error;
 903
 904         /*
 905          * If the vdev is closed, it's likely in the REMOVED or FAULTED state.
 906          * Nothing to be done here but return failure.
 907          */
 908         if (dvd == NULL || (dvd->vd_ldi_offline && dvd->vd_lh == NULL)) {
 909                 zio->io_error = ENXIO;
 910                 zio_interrupt(zio);
 911                 return;
 912         }

// KAPOW!  vdev_disk_off_notify() struck from another thread

1024         /*
1025          * In general we would expect ldi_strategy() to return non-zero only
1026          * because of programming errors, but we've also seen this fail shortly
1027          * after a disk dies.
1028          */
1029         if (ldi_strategy(dvd->vd_lh, bp) != 0) {
1030                 zio->io_error = ENXIO;
1031                 zio_interrupt(zio);
1032         }

Comment by Mike Gerdts
Created at 2019-12-20T17:18:42.647Z

I won't have time to work on this in the near future, transferring over to @jason.king. In the spirt of upstream first, I've been working this as illumos#4454.


Comment by Jason King
Created at 2020-01-03T22:38:49.738Z

Given all the issues trying to create the issue w/ iSCSI volumes, since I already had a SmartOS VM setup with volumes for running the zfs test suite, I booted the VM w/ a current build of master to see if I could induce errors to force the issue:

First creating the pool:

[root@zfs-test ~]# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
-       c2t0d0                  Virtio   Block Device       20.00 GiB   no  no
-       c3t0d0                  Virtio   Block Device       10.00 GiB   no  no
-       c4t0d0                  Virtio   Block Device       10.00 GiB   no  no
-       c5t0d0                  Virtio   Block Device       10.00 GiB   no  no
[root@zfs-test ~]# zpool create testpool mirror c3t0d0 c4t0d0 spare c5t0d0

Unfortunately, toggling the readonly property of a zvol only takes effect on first open, however the zvol_strategy function always checks the zv_flags field on write I/Os to see if the ZVOL_RDONLY flag is set. So the next thing is to obtain the address of the zvol_state_t for one of the mirrors. This was done by performing a read in the VM:

[root@zfs-test ~]# dd if=/dev/rdsk/c4t0d0 of=/dev/null count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000482 secs (1.01MB/sec)

At the same time, running the following simple dtrace command in the hypervisor (i.e. the actual SmartOS instance running on the raw hardware):

[root@bob ~]# dtrace -n 'zfsdev_get_soft_state:return { zv = (zvol_state_t *)arg1; printf("\n0x%llx %s\n", arg1, zv->zv_name) }'
dtrace: description 'zfsdev_get_soft_state:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
 10  71634     zfsdev_get_soft_state:return
0xfffffeb1b416f000 zones/8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9/disk2

Now to create some data so there'll be something to resilver:

[root@zfs-test ~]# dd if=/dev/urandom of=/testpool/foo count=10000
10000+0 records in
10000+0 records out
5120000 bytes transferred in 0.451938 secs (10.8MB/sec)

Unfortunately, I then noticed that bhyve uses the character device when using zvols, not the block device (so zvol_strategy is never called). However after making a small patch:

and booting the hypervisor with it, I was able to induce a replacement, however it didn't cause the device to go offline.


Comment by Jason King
Created at 2020-01-03T23:51:26.929Z

Trying a slightly different approach.. BHYVE can emulate an NVME device (thanks to andyf for pointing that out!).

I created an virtual NVME device for my test vm:

[root@bob ~]# mkfile 10g /zones/8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9/root/nvme.img
[root@bob ~]# echo '{ "bhyve_extra_opts": "-s 0:5,nvme,/nvme.img,ser=12345" }' | vmadm update 8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9

Verified it was visible:

[root@zfs-test ~]# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
NVME    c1t1d0                  NVMe     bhyve-NVMe         10.00 GiB   no  yes
-       c3t0d0                  Virtio   Block Device       20.00 GiB   no  no
-       c4t0d0                  Virtio   Block Device       10.00 GiB   no  no
-       c5t0d0                  Virtio   Block Device       10.00 GiB   no  no
-       c6t0d0                  Virtio   Block Device       10.00 GiB   no  no

I then made a small patch to the nvme driver to allow for detach while in use (note: once detached, it requires a reboot of the VM to reuse the device -- this was strictly for testing):

diff --git a/usr/src/uts/common/io/nvme/nvme.c b/usr/src/uts/common/io/nvme/nvme.c
index 66ec51c209..ec079d79ff 100644
--- a/usr/src/uts/common/io/nvme/nvme.c
+++ b/usr/src/uts/common/io/nvme/nvme.c
@@ -4518,6 +4518,12 @@ nvme_ioctl_detach(nvme_t *nvme, int nsid, nvme_ioctl_t *nioc, int mode,
        if (nvme->n_ns[nsid - 1].ns_ignore)
                return (0);

+       rv = ndi_devi_offline(nvme->n_dip, NDI_DEVI_REMOVE);
+       if (rv != DDI_SUCCESS)
+               return (EIO);
+
+       nvme_remove_callback(nvme->n_dip, 0, nvme, NULL);
+
        rv = bd_detach_handle(nvme->n_ns[nsid - 1].ns_bd_hdl);
        if (rv != DDI_SUCCESS)
                rv = EBUSY;

It fails with EIO (which is fine for this test), but still invokes the callbacks we're wanting:

[root@zfs-test ~]# dtrace -n 'ldi_invoke_notify:entry,vdev_disk_off_notify:entry {}'
dtrace: description 'ldi_invoke_notify:entry,vdev_disk_off_notify:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  2  20981          ldi_invoke_notify:entry
  2  52638       vdev_disk_off_notify:entry
  2  20981          ldi_invoke_notify:entry

Now we can make use of dtrace to recreate the panic, then test the fix.