Issue Type: | Bug |
---|---|
Priority: | 3 - Elevated |
Status: | Resolved |
Created at: | 2019-08-01T14:46:50.753Z |
Updated at: | 2021-08-09T17:52:27.094Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-02-04T14:17:58.405Z)
2020-02-13 Lloyd Braun (Release Date: 2020-02-13)
There have been several occurrences of panics involving the following stack at about the same time as disk errors are logged. The first two arguments to ldi_ioctl
vary. The first is always a garbage pointer and the second is a valid DKIO
.
ffffd003d5e7b990 ldi_ioctl+0x2e(5ca96e63, 425, ffffd003d5e7b9e8, 80000000, ffffd0c4fcf1ae18, 0) ffffd003d5e7ba60 vdev_disk_open+0x4f1(ffffd0c578140680, ffffd003d5e7ba98, ffffd003d5e7ba90, ffffd003d5e7ba88) ffffd003d5e7bad0 vdev_open+0xdd(ffffd0c578140680) ffffd003d5e7bb00 vdev_reopen+0x3b(ffffd0c578140680) ffffd003d5e7bb40 spa_async_probe+0x8a(ffffd0c571f6a000, ffffd0c578140680) ffffd003d5e7bb80 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c578142640) ffffd003d5e7bbc0 spa_async_probe+0x56(ffffd0c571f6a000, ffffd0c5766a7d40) ffffd003d5e7bc20 spa_async_thread+0x281(ffffd0c571f6a000) ffffd003d5e7bc30 thread_start+8()
There are three calls to ldi_ioctl
in vdev_disk_open
:
335 vdev_disk_alloc(vd); 336 dvd = vd->vdev_tsd; ... 568 if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT, ... 573 } else if ((error = ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFO, ... 581 } else { ... 586 } ... 590 if (vd->vdev_wholedisk == 1) { ... 607 (void) ldi_ioctl(dvd->vd_lh, DKIOCSETWCE, (intptr_t)&wce, 608 FKIOCTL, kcred, NULL); 609 }
In this case, we are calling ldi_ioctl(<bad_address>, DKIOSETWCE, ...)
, the call at line 607. In order to get here, we've made at least one other call to ldi_ioctl(dvd->vd_lh, ...)
with a valid address in dvd->vd_lh
. This appears to be a use-after-free, and is supported by:
> ffffd0c578140680::print vdev_t vdev_tsd vdev_tsd = 0
The only place that vdev_tsd
is set to NULL
is in vdev_disk_free
, which is called only by:
1 172 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_off_finalize>> vdev_disk_free(vd); 2 325 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_open>> vdev_disk_free(vd); 3 653 usr/src/uts/common/fs/zfs/vdev_disk.c <<vdev_disk_close>> vdev_disk_free(vd);
vdev_disk_off_finalize
is only called as an ldi callback. vdev_disk_open
is not suspect as a race involving itself would be blocked by spa_config_lock
taken higher in the stack. vdev_disk_close
may be called via a vdev_disk_ops
callback or via an ldi callback. The ldi callbacks are quite suspicious.
184 static ldi_ev_callback_t vdev_disk_off_callb = { 185 .cb_vers = LDI_EV_CB_VERS, 186 .cb_notify = vdev_disk_off_notify, 187 .cb_finalize = vdev_disk_off_finalize 188 };
Neither of those callbacks call spa_config_enter
, allowing their manipulation of the vdev_disk to race with vdev_disk_open
.
Naïvely, the fix could be to simply add calls to:
spa_config_enter(vd->vd_spa, SCL_STATE_ALL, RW_WRITER);
in those functions. It is not clear to me, however, that this is the right thing to do when it is racing with vdev_reopen
.
In any case, vdev_disk_close
would do well to call:
VERIFY3S(spa_config_held(vd->vd_spa, SCL_STATE_ALL, RW_WRITER), ==, SCL_STATE_ALL);
This note was written as tests were running in anticipation that all would go well. The zfs tests initially led to memory exhaustion on a too-small VM. The dtrace watcher script shows that zfs tests never exercised this code path. The loop at the end of this note has been shown to bring instability to debug and non-debug kernels. The general pattern is that spa_sync()
has a read lock when vdisk_offline_finalize() tries to take the write lock. I suspect that there is something in the vdisk_offline_finalize() stack that holds a lock that some zio needs.
I added calls to spa_config_enter()
and spa_config_exit()
to vdev_disk_off_notify()
and vdev_disk_off_finalize()
as I suggested in my initial analysis. There are legitimate callers to vdev_disk_close()
that do not hold the SCL_STATE_ALL
set of locks, so the VERIFY()
I suggested is in appropriate.
During all testing I have had the following dtrace script running to watch for times when I hit this code path.
#! /usr/sbin/dtrace -s vdev_disk_off_notify:entry, vdev_disk_off_finalize:entry { self->interest = 1; } ldi_ev_get_type:return /self->interest/ { printf("%Y %s", walltimestamp, stringof(arg1)); stack(); self->interest = 0; }
The ZFS test suite on debug is underway. This space will be updated.
I found the easiest way to trigger this code path was with iscsi. Offlining an LU at the target can trigger this code path.
In the setup described below, the target and initiator are on two different systems. While it may be possible to have them on the same system, this could lead to same-kernel zfs-on-zfs deadlocks.
The iscsi target and initiator settings do not survive reboot on SmartOS. After a reboot, the svcadm
, itadm
, stmfadm
, and iscsiadm
setup commands will need to be repeated. After they are executed, any pool that had previously existed on the LUs (but not destroyed) may be imported.
[root@cn2 (coal-1) ~]# svcadm enable -r iscsi/target [root@cn2 (coal-1) ~]# itadm create-target Target iqn.2010-08.org.illumos:02:55bbfe58-53aa-4921-bae5-eda0ea7e2f33 successfully created [root@cn2 (coal-1) ~]# zfs create zones/zfstest [root@cn2 (coal-1) ~]# zfs set compression=on zones/zfstest [root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun0 [root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun1 [root@cn2 (coal-1) ~]# zfs create -V 1g zones/zfstest/lun2 [root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun0 Logical unit created: 600144F00C9D700000005DF0EC0B0001 [root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun1 Logical unit created: 600144F00C9D700000005DF0EC0D0002 [root@cn2 (coal-1) ~]# stmfadm create-lu /dev/zvol/rdsk/zones/zfstest/lun2 Logical unit created: 600144F00C9D700000005DF0EC0E0003 [root@cn2 (coal-1) ~]# stmfadm list-lu LU Name: 600144F00C9D700000005DF0EC0B0001 LU Name: 600144F00C9D700000005DF0EC0D0002 LU Name: 600144F00C9D700000005DF0EC0E0003 [root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0B0001 [root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002 [root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0E0003 [root@cn2 (coal-1) ~]# stmfadm list-lu | awk '{print "stmfadm add-view", $NF}' | sh -ex + stmfadm add-view 600144F00C9D700000005DF0EC0B0001 + stmfadm add-view 600144F00C9D700000005DF0EC0D0002 + stmfadm add-view 600144F00C9D700000005DF0EC0E0003 [root@cn2 (coal-1) ~]# ifconfig -a lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 e1000g0: flags=1001004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,FIXEDMTU> mtu 1500 index 2 inet 10.99.99.40 netmask ffffff00 broadcast 10.99.99.255 ether 52:54:0:48:bb:8c lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1 inet6 ::1/128
[root@cn3 (coal-1) ~]# svcadm enable -rs iscsi/initiator [root@cn3 (coal-1) ~]# iscsiadm modify discovery -t enable [root@cn3 (coal-1) ~]# iscsiadm add discovery-address 10.99.99.40 [root@cn3 (coal-1) ~]# devfsadm Could not open /etc/saf/zsmon/_pmtab [root@cn3 (coal-1) ~]# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t600144F00C9D700000005DF0EC0B0001d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32> /scsi_vhci/disk@g600144f00c9d700000005df0ec0b0001 1. c0t600144F00C9D700000005DF0EC0D0002d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32> /scsi_vhci/disk@g600144f00c9d700000005df0ec0d0002 2. c0t600144F00C9D700000005DF0EC0E0003d0 <SUN-COMSTAR-1.0 cyl 1022 alt 2 hd 64 sec 32> /scsi_vhci/disk@g600144f00c9d700000005df0ec0e0003 3. c1d0 <QEMU HARDDISK=QM00001-QM00001-0001-40.00GB> /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0 4. c2t0d0 <Virtio-Block Device-0000-1.00GB> /pci@0,0/pci1af4,2@9/blkdev@0,0 5. c3t0d0 <Virtio-Block Device-0000-1.00GB> /pci@0,0/pci1af4,2@a/blkdev@0,0 Specify disk (enter its number): ^D [root@cn3 (coal-1) ~]# zpool create iscsi \ mirror c0t600144F00C9D700000005DF0EC0B0001d0 c0t600144F00C9D700000005DF0EC0D0002d0 \ spare c0t600144F00C9D700000005DF0EC0E0003d0
The initiator system is running the 20191211T012722Z debug build. I use four windows:
1. On the initiator CN, run ./watch-OS-7920
(the dtrace script mentioned above)
2. On the initiator CN, run while true; do dd if=/dev/zero of=/iscsi/file bs=1024k count=100
3. On the initiator CN, run other commands
4. On the target CN, disable and enable an LU
Start the watcher in window 1 and the dd
loop in window2. Wait for dd
to report stats at least once.
In window 4 (target system), offline one of the LUs that makes up the mirror.
[root@cn2 (coal-1) ~]# stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002
In window 1, notice that the changed code path was hit:
[root@cn3 (coal-1) /var/tmp/OS-7920]# ./watch-OS-7920 dtrace: script './watch-OS-7920' matched 3 probes CPU ID FUNCTION:NAME 0 20147 ldi_ev_get_type:return 2019 Dec 11 13:28:41 LDI:EVENT:OFFLINE zfs`vdev_disk_off_notify+0x37 genunix`ldi_invoke_notify+0x11a genunix`e_ddi_offline_notify+0x8b genunix`devi_detach_node+0x89 genunix`ndi_devi_offline+0x13d genunix`i_mdi_pi_state_change+0x5d9 genunix`mdi_pi_offline+0x3d iscsi`iscsi_lun_offline+0xa6 iscsi`iscsi_lun_destroy+0x43 iscsi`iscsi_sess_reportluns+0x3cc iscsi`iscsi_sess_enumeration+0xb6 genunix`taskq_thread+0x315 unix`thread_start+0xb 3 20147 ldi_ev_get_type:return 2019 Dec 11 13:28:43 LDI:EVENT:OFFLINE zfs`vdev_disk_off_finalize+0x3a genunix`ldi_invoke_finalize+0x132 genunix`e_ddi_offline_finalize+0x79 genunix`devi_detach_node+0x1d7 genunix`ndi_devi_offline+0x13d genunix`i_mdi_pi_state_change+0x5d9 genunix`mdi_pi_offline+0x3d iscsi`iscsi_lun_offline+0xa6 iscsi`iscsi_lun_destroy+0x43 iscsi`iscsi_sess_reportluns+0x3cc iscsi`iscsi_sess_enumeration+0xb6 genunix`taskq_thread+0x315 unix`thread_start+0xb
In window 3, observe that the spare was added to the mirror.
[root@cn3 (coal-1) ~]# zpool status pool: iscsi state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019 config: NAME STATE READ WRITE CKSUM iscsi DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0 spare-1 DEGRADED 0 0 0 c0t600144F00C9D700000005DF0EC0D0002d0 REMOVED 0 0 0 c0t600144F00C9D700000005DF0EC0E0003d0 ONLINE 0 0 0 spares c0t600144F00C9D700000005DF0EC0E0003d0 INUSE currently in use errors: No known data errors ...
In window 4, online the LU
[root@cn2 (coal-1) ~]# stmfadm online-lu 600144F00C9D700000005DF0EC0D0002
In window 3, observe that the pool returns to its initial configuration
[root@cn3 (coal-1) ~]# zpool status pool: iscsi state: ONLINE scan: resilvered 3.15M in 0 days 00:00:04 with 0 errors on Wed Dec 11 13:28:52 2019 config: NAME STATE READ WRITE CKSUM iscsi ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0 spare-1 ONLINE 0 0 0 c0t600144F00C9D700000005DF0EC0D0002d0 ONLINE 0 0 0 c0t600144F00C9D700000005DF0EC0E0003d0 ONLINE 0 0 0 spares c0t600144F00C9D700000005DF0EC0E0003d0 INUSE currently in use errors: No known data errors ... [root@cn3 (coal-1) ~]# zpool status pool: iscsi state: ONLINE scan: resilvered 333K in 0 days 00:00:01 with 0 errors on Wed Dec 11 13:29:17 2019 config: NAME STATE READ WRITE CKSUM iscsi ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t600144F00C9D700000005DF0EC0B0001d0 ONLINE 0 0 0 c0t600144F00C9D700000005DF0EC0D0002d0 ONLINE 0 0 0 spares c0t600144F00C9D700000005DF0EC0E0003d0 AVAIL errors: No known data errors ...
After doing this once manually, I put the offline/online into a loop:
[root@cn2 (coal-1) ~]# (set -ex; while true; do stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002; sleep 15; stmfadm online-lu 600144F00C9D700000005DF0EC0D0002; sleep 45; done) + true + stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002 + sleep 15 + stmfadm online-lu 600144F00C9D700000005DF0EC0D0002 + sleep 45 + true + stmfadm offline-lu 600144F00C9D700000005DF0EC0D0002 + sleep 15 + stmfadm online-lu 600144F00C9D700000005DF0EC0D0002 + sleep 45 ...
Also, I watched the pool cycle through its various stages of self-healing with:
[root@cn3 (coal-1) ~]# while true; do date; zpool status iscsi; sleep 10; done
Multiple systems in a "we just repaired hardware state" have hit NULL pointer dereferences in zvol_dumpio_vdev
. In one instance it led to a panic loop as described in OPS-4518 and another one panic was followed by a panic loop that looks like that described in this ticket.
OPS-4518 had a stack that looked like:
JA701916 ttyb login: 2018-08-30T12:39:2.778618+00:00 JA701916 scsi: [ID 107833 kern.warning] WARNING: /pci@cd,0/pci8086,2030@0/pci15d9,808@0/iport@8/disk@w5000cca02dc51559,0 (sd8):#012#011Unable to clean up memory because of pending I/O#012 panic[cpu0]/thread=fffffb04105eec40: BAD TRAP: type=e (#pf Page fault) rp=fffffb04105ee440 addr=10 occurred in module "zfs" due to a NULL pointer dereference fsflush: #pf Page fault Bad kernel fault at addr=0x10 pid=3, pc=0xfffffffff7e2728f, sp=0xfffffb04105ee530, eflags=0x10206 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 3606f8<smap,smep,osxsav,pcide,xmme,fxsr,pge,mce,pae,pse,de> cr2: 10 cr3: 1e000000 cr8: 0 rdi: fffffdc3e5d2b800 rsi: fffffafa95e61000 rdx: 1000 rcx: 221dcd1000 r8: c0 r9: 0 rax: 0 rbx: fffffdc3e5d2b800 rbp: fffffb04105ee5c0 r10: fffffffffb870be8 r11: 0 r12: 0 r13: 0 r14: 0 r15: 221dcd1000 fsb: 0 gsb: fffffffffbc4c000 ds: 4b es: 4b fs: 0 gs: 1c3 trp: e err: 0 rip: fffffffff7e2728f cs: 30 rfl: 10206 rsp: fffffb04105ee530 ss: 38 fffffb04105ee330 unix:real_mode_stop_cpu_stage2_end+bc1c () fffffb04105ee430 unix:trap+15b1 () fffffb04105ee440 unix:_cmntrap+e6 () fffffb04105ee5c0 zfs:zvol_dumpio_vdev+10f () fffffb04105ee660 zfs:zvol_dumpio_vdev+198 () fffffb04105ee700 zfs:zvol_dumpio+133 () fffffb04105ee7a0 zfs:zvol_strategy+2e5 () fffffb04105ee7d0 genunix:bdev_strategy+61 () fffffb04105ee830 specfs:spec_startio+8e () fffffb04105ee8e0 specfs:spec_putapage+de () fffffb04105ee990 specfs:spec_putpage+1dd () fffffb04105eea10 genunix:fop_putpage+4c () fffffb04105eea90 genunix:fsflush_do_pages+4a2 () fffffb04105eeb30 genunix:fsflush+49a () fffffb04105eeb40 unix:thread_start+8 () dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
The other panic on 20190228T233723Z has the same stack. For the frames in the zfs module, the offsets are the same.
For the 20190228 release, zvol_dumpio_vdev is:
> ::status debugging live kernel (64-bit) on cn4 operating system: 5.11 joyent_20190228T233723Z (i86pc) git branch: release-20190228 git rev: 7895a719812f7049054cb9b3fb04d28893f0af16 image uuid: (not set) > zvol_dumpio_vdev::dis ... zvol_dumpio_vdev+0xfa: movq -0x78(%rbp),%rdx zvol_dumpio_vdev+0xfe: sbbl %r8d,%r8d zvol_dumpio_vdev+0x101: movq -0x70(%rbp),%rsi zvol_dumpio_vdev+0x105: movq %r15,%rcx zvol_dumpio_vdev+0x108: andl $0xc0,%r8d zvol_dumpio_vdev+0x10f: movq 0x10(%rax),%rdi zvol_dumpio_vdev+0x113: addl $0x40,%r8d zvol_dumpio_vdev+0x117: call -0x285dc <vdev_disk_ldi_physio> ...
That corresponds to this:
1173 } else { 1174 dvd = vd->vdev_tsd; 1175 ASSERT3P(dvd, !=, NULL); 1176 return (vdev_disk_ldi_physio(dvd->vd_lh, addr, size, 1177 offset, doread ? B_READ : B_WRITE)); 1178 }
From this we can gather that dvd
is NULL. We should be safe using dvd
because zvol_dumpio()
took a read lock at line 1208. I have looked to verify that there could not be a race with ddi_in_panic()
- if it returns true, all CPUs other than the one running the panic thread are disabled. The panics described in this comment are always a fsflush()
thread that is not panicking.
1207 if (!ddi_in_panic()) 1208 spa_config_enter(spa, SCL_STATE, FTAG, RW_READER); 1209 1210 vd = vdev_lookup_top(spa, DVA_GET_VDEV(&ze->ze_dva)); 1211 offset += DVA_GET_OFFSET(&ze->ze_dva); 1212 error = zvol_dumpio_vdev(vd, addr, offset, DVA_GET_OFFSET(&ze->ze_dva), 1213 size, doread, isdump); 1214 1215 if (!ddi_in_panic()) 1216 spa_config_exit(spa, SCL_STATE, FTAG);
This leads us back to "what could modify vd->vdev_tsd
?" We get back to one of the stars of this bug, vdev_disk_free()
.
94 static void 95 vdev_disk_free(vdev_t *vd) 96 { ... 113 kmem_free(dvd, sizeof (vdev_disk_t)); 114 vd->vdev_tsd = NULL; 115 }
vdev_disk_free()
is called only in an error path of vdev_disk_open()
and by vdev_disk_close()
. Notably, vdev_disk_off_notify()
does not hold a lock when it calls vdev_disk_close()
.
Therefore, this bug is almost certainly the root cause of the panic loop seen in OPS-4518.
With a plan to commit the fix into illumos, I switched over to using OmniOS with current illumos-gate. There I've hit a couple different panics than what I've hit on SmartOS. I've hit one in abd_free_linear
many times but have not been able to confidently point my finger at this as the cause. See illumos#12076.
Another one, however, I can confidently say is caused by this.
> ::status debugging crash dump vmcore.7 (64-bit) from omni-2 operating system: 5.11 omni-physio-0-g59a9ce0e70 (i86pc) build version: gfx-drm - heads/master-0-gbdc58b1-dirty image uuid: b1d960bc-afbe-43c0-eea2-a975cb6312b2 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0008a99710 addr=ffffff094ae64778 dump content: kernel pages only > $C ffffff0008a99820 bdev_strategy+0x26(ffffff029ee775c0) ffffff0008a99850 ldi_strategy+0x47(ffffff03086d9600, ffffff029ee775c0) ffffff0008a998c0 vdev_disk_io_start+0x254(ffffff02733b45b8) ffffff0008a99920 zio_vdev_io_start+0xdf(ffffff02733b45b8) ffffff0008a99960 zio_execute+0xf5(ffffff02733b45b8) ffffff0008a99990 zio_nowait+0x87(ffffff02733b45b8) ffffff0008a999d0 vdev_queue_io_done+0x92(ffffff030f504648) ffffff0008a99a10 zio_vdev_io_done+0xc8(ffffff030f504648) ffffff0008a99a50 zio_execute+0xf5(ffffff030f504648) ffffff0008a99b00 taskq_thread+0x315(ffffff03086e0ea0) ffffff0008a99b10 thread_start+0xb() >
This was purportedly fixed in OS-7797, but my workspace has that fix. It seems to have just tightened the race. Here we can see that the OS-7797 fix is present (failure of ldi_strategy()
now calls zio_interrupt()
) we still ran into the disk being offline and vd_lh
being NULL
.
> vdev_disk_io_start+0x254::dis ... vdev_disk_io_start+0x24f: call +0x442223c <ldi_strategy> vdev_disk_io_start+0x254: testl %eax,%eax vdev_disk_io_start+0x256: je -0xea <vdev_disk_io_start+0x172> vdev_disk_io_start+0x25c: jmp -0x101 <vdev_disk_io_start+0x160> ... > vdev_disk_io_start+0x160::dis ... vdev_disk_io_start+0x160: movl $0x6,0x31c(%rbx) vdev_disk_io_start+0x16a: movq %rbx,%rdi vdev_disk_io_start+0x16d: call -0x14cb2 <zio_interrupt> ... > ffffff02733b45b8::print zio_t io_vd | ::print vdev_t vdev_tsd | ::print vdev_disk_t { vd_devid = 0 vd_minor = 0 vd_lh = 0 vd_ldi_cbs = { list_size = 0x18 list_offset = 0 list_head = { list_next = 0xffffff026e2ad008 list_prev = 0xffffff18277fc4b8 } } vd_ldi_offline = 0x1 (B_TRUE) }
That is:
893 static void 894 vdev_disk_io_start(zio_t *zio) 895 { 896 vdev_t *vd = zio->io_vd; 897 vdev_disk_t *dvd = vd->vdev_tsd; 898 unsigned long trim_flags = 0; 899 vdev_buf_t *vb; 900 struct dk_callback *dkc; 901 buf_t *bp; 902 int error; 903 904 /* 905 * If the vdev is closed, it's likely in the REMOVED or FAULTED state. 906 * Nothing to be done here but return failure. 907 */ 908 if (dvd == NULL || (dvd->vd_ldi_offline && dvd->vd_lh == NULL)) { 909 zio->io_error = ENXIO; 910 zio_interrupt(zio); 911 return; 912 } // KAPOW! vdev_disk_off_notify() struck from another thread 1024 /* 1025 * In general we would expect ldi_strategy() to return non-zero only 1026 * because of programming errors, but we've also seen this fail shortly 1027 * after a disk dies. 1028 */ 1029 if (ldi_strategy(dvd->vd_lh, bp) != 0) { 1030 zio->io_error = ENXIO; 1031 zio_interrupt(zio); 1032 }
I won't have time to work on this in the near future, transferring over to @accountid:62562f1f9e7380006fb2e85b. In the spirt of upstream first, I've been working this as illumos#4454.
WARNING: Page83 data not standards compliant
. I've not had time to fully debug this yet. If manta will be using iscsi, we should get to the bottom of this - I tend to hit it in the first 200 or so iterations of a iscsi lun going offline. If we have iscsi target servers with 34 luns, does that mean that every 6th reboot of one of those boxes, one of the initiator boxes will suffer the same fate? But maybe my fix is inducing this. A dump is being uploaded to /mgerdts/public/OS-7920/vmdump.20Given all the issues trying to create the issue w/ iSCSI volumes, since I already had a SmartOS VM setup with volumes for running the zfs test suite, I booted the VM w/ a current build of master to see if I could induce errors to force the issue:
First creating the pool:
[root@zfs-test ~]# diskinfo TYPE DISK VID PID SIZE RMV SSD - c2t0d0 Virtio Block Device 20.00 GiB no no - c3t0d0 Virtio Block Device 10.00 GiB no no - c4t0d0 Virtio Block Device 10.00 GiB no no - c5t0d0 Virtio Block Device 10.00 GiB no no [root@zfs-test ~]# zpool create testpool mirror c3t0d0 c4t0d0 spare c5t0d0
Unfortunately, toggling the readonly
property of a zvol only takes effect on first open, however the zvol_strategy
function always checks the zv_flags
field on write I/Os to see if the ZVOL_RDONLY
flag is set. So the next thing is to obtain the address of the zvol_state_t
for one of the mirrors. This was done by performing a read in the VM:
[root@zfs-test ~]# dd if=/dev/rdsk/c4t0d0 of=/dev/null count=1 1+0 records in 1+0 records out 512 bytes transferred in 0.000482 secs (1.01MB/sec)
At the same time, running the following simple dtrace command in the hypervisor (i.e. the actual SmartOS instance running on the raw hardware):
[root@bob ~]# dtrace -n 'zfsdev_get_soft_state:return { zv = (zvol_state_t *)arg1; printf("\n0x%llx %s\n", arg1, zv->zv_name) }' dtrace: description 'zfsdev_get_soft_state:return ' matched 1 probe CPU ID FUNCTION:NAME 10 71634 zfsdev_get_soft_state:return 0xfffffeb1b416f000 zones/8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9/disk2
Now to create some data so there'll be something to resilver:
[root@zfs-test ~]# dd if=/dev/urandom of=/testpool/foo count=10000 10000+0 records in 10000+0 records out 5120000 bytes transferred in 0.451938 secs (10.8MB/sec)
Unfortunately, I then noticed that bhyve uses the character device when using zvols, not the block device (so zvol_strategy
is never called). However after making a small patch:
and booting the hypervisor with it, I was able to induce a replacement, however it didn't cause the device to go offline.
Trying a slightly different approach.. BHYVE can emulate an NVME device (thanks to andyf for pointing that out!).
I created an virtual NVME device for my test vm:
[root@bob ~]# mkfile 10g /zones/8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9/root/nvme.img [root@bob ~]# echo '{ "bhyve_extra_opts": "-s 0:5,nvme,/nvme.img,ser=12345" }' | vmadm update 8e5e4967-3c9e-ed13-bb8a-bbbecc5502d9
Verified it was visible:
[root@zfs-test ~]# diskinfo TYPE DISK VID PID SIZE RMV SSD NVME c1t1d0 NVMe bhyve-NVMe 10.00 GiB no yes - c3t0d0 Virtio Block Device 20.00 GiB no no - c4t0d0 Virtio Block Device 10.00 GiB no no - c5t0d0 Virtio Block Device 10.00 GiB no no - c6t0d0 Virtio Block Device 10.00 GiB no no
I then made a small patch to the nvme driver to allow for detach while in use (note: once detached, it requires a reboot of the VM to reuse the device -- this was strictly for testing):
diff --git a/usr/src/uts/common/io/nvme/nvme.c b/usr/src/uts/common/io/nvme/nvme.c index 66ec51c209..ec079d79ff 100644 --- a/usr/src/uts/common/io/nvme/nvme.c +++ b/usr/src/uts/common/io/nvme/nvme.c @@ -4518,6 +4518,12 @@ nvme_ioctl_detach(nvme_t *nvme, int nsid, nvme_ioctl_t *nioc, int mode, if (nvme->n_ns[nsid - 1].ns_ignore) return (0); + rv = ndi_devi_offline(nvme->n_dip, NDI_DEVI_REMOVE); + if (rv != DDI_SUCCESS) + return (EIO); + + nvme_remove_callback(nvme->n_dip, 0, nvme, NULL); + rv = bd_detach_handle(nvme->n_ns[nsid - 1].ns_bd_hdl); if (rv != DDI_SUCCESS) rv = EBUSY;
It fails with EIO (which is fine for this test), but still invokes the callbacks we're wanting:
[root@zfs-test ~]# dtrace -n 'ldi_invoke_notify:entry,vdev_disk_off_notify:entry {}' dtrace: description 'ldi_invoke_notify:entry,vdev_disk_off_notify:entry ' matched 2 probes CPU ID FUNCTION:NAME 2 20981 ldi_invoke_notify:entry 2 52638 vdev_disk_off_notify:entry 2 20981 ldi_invoke_notify:entry
Now we can make use of dtrace to recreate the panic, then test the fix.