OS-6602: ZFS not detecting faulty spares in a timely manner

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-02-10T00:07:03.396Z
Updated at:2019-10-14T17:41:53.238Z

People

Created by:Michael Hicks
Reported by:Michael Hicks
Assigned to:Kody Kantor [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-02-07T15:11:11.346Z)

Fix Versions

2019-02-14 Liz Lemon (Release Date: 2019-02-14)

Related Issues

Related Links

Description

An audit detected bad spares which zfs is unaware of.

On investigating a disparity in disk count between cnapi and zpool status, I found that this system in `diskinfo -cH` is not showing the disk at all, its not faulted in `zpool status` output and `/var/adm/messages` is full of "drive offline" messages for it spanning several days:

2018-02-08T16:01:00.086755+00:00
2018-02-10T00:00:00.151602+00:00
[mhicks@headnode (ap-southeast-1b) ~]$ sdc-oneachnode -n MSA08637 'grep drive offline /var/adm/messages | head -n2'
=== Output from 00000000-0000-0000-0000-0cc47ade88e6 (MSA08637):
/var/adm/messages:2018-02-08T16:01:00.086755+00:00 MSA08637 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,6f06@2,2/pci15d9,808@0/iport@ff/disk@w5000cca2606aabd5,0 (sd5):#012#011drive offline#012

[mhicks@headnode (ap-southeast-1b) ~]$ sdc-oneachnode -n MSA08637 'tail  /var/adm/messages | grep "drive offline"'
=== Output from 00000000-0000-0000-0000-0cc47ade88e6 (MSA08637):
...
2018-02-10T00:00:00.095134+00:00 MSA08637 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,6f06@2,2/pci15d9,808@0/iport@ff/disk@w5000cca2606aabd5,0 (sd5):#012#011drive offline#012
2018-02-10T00:00:00.151602+00:00 MSA08637 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,6f06@2,2/pci15d9,808@0/iport@ff/disk@w5000cca2606aabd5,0 (sd5):#012#011drive offline#012

iostat shows :

  s/w h/w trn tot device
    0   0   0   0 lofi1
    0   0   0   0 ramdisk1
    0   0   0   0 c2t0d0
    0   0   0   0 c1t5000CCA0496F7689d0
    0   0   0   0 c1t5000CCA2606ABF25d0
    0   0 111197 111197 c1t5000CCA2606AABD5d0
    0   0   3   3 c1t5000CCA2606A5F51d0
    0   0   3   3 c1t5000CCA2606A1DA9d0
    0   0   3   3 c1t5000CCA2606A7325d0
    0   0   3   3 c1t5000CCA2606A0FDDd0
    0   0   3   3 c1t5000CCA26069F3EDd0

shouldn't ZFS have offlined it and swapped in a spare by now?

Similarly on MSB09877 in EU-CENTRAL-1C

SERVLIST=MSB09877
[mhicks@headnode (eu-central-1c) ~]$ sdc-oneachnode -n "$SERVLIST" 'ZPS="$(zpool status)";  DI="$(diskinfo -cH)"; diff -wu <(echo "$ZPS"| grep c1 | awk "{print \$1}"| sort) <(echo "$DI"| awk "{print \$2}"| sort); echo "Zpool has $(echo "$ZPS"| grep -c c1) disks, diskinfo has $(echo "$DI"| wc -l) disks"; echo "$ZPS"; echo "$DI"; iostat -en'
=== Output from 00000000-0000-0000-0000-0cc47adebaac (MSB19562):
--- /dev/fd/63  Sat Feb 10 00:25:58 2018
+++ /dev/fd/62  Sat Feb 10 00:25:58 2018
@@ -33,4 +33,4 @@
 c1t5000CCA2609A1AF5d0
 c1t5000CCA2609A1DF5d0
 c1t5000CCA2609A2A31d0
-c1t5000CCA2609A4BEDd0
+c2t0d0
Zpool has 36 disks, diskinfo has       36 disks
  pool: zones
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c1t5000CCA2609055DDd0  ONLINE       0     0     0
            c1t5000CCA2609892BDd0  ONLINE       0     0     0
            c1t5000CCA2609892D5d0  ONLINE       0     0     0
            c1t5000CCA260989359d0  ONLINE       0     0     0
            c1t5000CCA26098965Dd0  ONLINE       0     0     0
            c1t5000CCA2609896C5d0  ONLINE       0     0     0
            c1t5000CCA260989BF9d0  ONLINE       0     0     0
            c1t5000CCA26098D7FDd0  ONLINE       0     0     0
            c1t5000CCA26098E5F9d0  ONLINE       0     0     0
            c1t5000CCA26098FE65d0  ONLINE       0     0     0
            c1t5000CCA26098FEB9d0  ONLINE       0     0     0
          raidz2-1                 ONLINE       0     0     0
            c1t5000CCA2609909B9d0  ONLINE       0     0     0
            c1t5000CCA2609909F9d0  ONLINE       0     0     0
            c1t5000CCA2609909FDd0  ONLINE       0     0     0
            c1t5000CCA260990AC5d0  ONLINE       0     0     0
            c1t5000CCA26099123Dd0  ONLINE       0     0     0
            c1t5000CCA2609930E5d0  ONLINE       0     0     0
            c1t5000CCA26099330Dd0  ONLINE       0     0     0
            c1t5000CCA260993641d0  ONLINE       0     0     0
            c1t5000CCA260993C5Dd0  ONLINE       0     0     0
            c1t5000CCA260993DA5d0  ONLINE       0     0     0
            c1t5000CCA26099452Dd0  ONLINE       0     0     0
          raidz2-2                 ONLINE       0     0     0
            c1t5000CCA260998EC1d0  ONLINE       0     0     0
            c1t5000CCA2609992D1d0  ONLINE       0     0     0
            c1t5000CCA260999739d0  ONLINE       0     0     0
            c1t5000CCA2609997F1d0  ONLINE       0     0     0
            c1t5000CCA260999C11d0  ONLINE       0     0     0
            c1t5000CCA260999C6Dd0  ONLINE       0     0     0
            c1t5000CCA260999CFDd0  ONLINE       0     0     0
            c1t5000CCA26099B4F5d0  ONLINE       0     0     0
            c1t5000CCA26099B8A1d0  ONLINE       0     0     0
            c1t5000CCA2609A1AF5d0  ONLINE       0     0     0
            c1t5000CCA2609A1DF5d0  ONLINE       0     0     0
        logs
          c1t5000CCA0496F846Dd0    ONLINE       0     0     0
        spares
          c1t5000CCA2609A2A31d0    AVAIL   
          c1t5000CCA2609A4BEDd0    AVAIL   

errors: No known data errors
SCSI    c1t5000CCA0496F846Dd0   HGST    HUSMH8010BSS204 0HWZA7PA          93.16 GiB     ---S    [0] Slot00
SCSI    c1t5000CCA260993DA5d0   HGST    HUH728080AL4204 VLJR8L7V        7452.04 GiB     ----    [0] Slot01
SCSI    c1t5000CCA26098965Dd0   HGST    HUH728080AL4204 VLJPXEWV        7452.04 GiB     ----    [0] Slot02
SCSI    c1t5000CCA2609892D5d0   HGST    HUH728080AL4204 VLJPX6LV        7452.04 GiB     ----    [0] Slot03
SCSI    c1t5000CCA26098D7FDd0   HGST    HUH728080AL4204 VLJR1UBV        7452.04 GiB     ----    [0] Slot04
SCSI    c1t5000CCA260989BF9d0   HGST    HUH728080AL4204 VLJPXUGV        7452.04 GiB     ----    [0] Slot05
SCSI    c1t5000CCA260999C11d0   HGST    HUH728080AL4204 VLJRGW5V        7452.04 GiB     ----    [0] Slot06
SCSI    c1t5000CCA2609909B9d0   HGST    HUH728080AL4204 VLJR541V        7452.04 GiB     ----    [0] Slot08
SCSI    c1t5000CCA26098E5F9d0   HGST    HUH728080AL4204 VLJR2S7V        7452.04 GiB     ----    [0] Slot09
SCSI    c1t5000CCA2609997F1d0   HGST    HUH728080AL4204 VLJRGLNV        7452.04 GiB     ----    [0] Slot10
SCSI    c1t5000CCA2609992D1d0   HGST    HUH728080AL4204 VLJRG82V        7452.04 GiB     ----    [0] Slot11
SCSI    c1t5000CCA260998EC1d0   HGST    HUH728080AL4204 VLJREZPV        7452.04 GiB     ----    [0] Slot12
SCSI    c1t5000CCA2609909F9d0   HGST    HUH728080AL4204 VLJR54KV        7452.04 GiB     ----    [0] Slot13
SCSI    c1t5000CCA2609055DDd0   HGST    HUH728080AL4204 VLJKBSLV        7452.04 GiB     ----    [0] Slot14
SCSI    c1t5000CCA260999C6Dd0   HGST    HUH728080AL4204 VLJRGWXV        7452.04 GiB     ----    [0] Slot15
SCSI    c1t5000CCA2609A1AF5d0   HGST    HUH728080AL4204 VLJRSA4V        7452.04 GiB     ----    [0] Slot16
SCSI    c1t5000CCA2609909FDd0   HGST    HUH728080AL4204 VLJR54LV        7452.04 GiB     ----    [0] Slot17
SCSI    c1t5000CCA26099330Dd0   HGST    HUH728080AL4204 VLJR7WBV        7452.04 GiB     ----    [0] Slot18
SCSI    c1t5000CCA2609892BDd0   HGST    HUH728080AL4204 VLJPX6DV        7452.04 GiB     ----    [0] Slot19
SCSI    c1t5000CCA260989359d0   HGST    HUH728080AL4204 VLJPX7NV        7452.04 GiB     ----    [0] Slot20
SCSI    c1t5000CCA26099B8A1d0   HGST    HUH728080AL4204 VLJRJT4V        7452.04 GiB     ----    [0] Slot21
SCSI    c1t5000CCA260999739d0   HGST    HUH728080AL4204 VLJRGK5V        7452.04 GiB     ----    [0] Slot22
SCSI    c1t5000CCA2609A1DF5d0   HGST    HUH728080AL4204 VLJRSJAV        7452.04 GiB     ----    [0] Slot23
SCSI    c1t5000CCA2609A2A31d0   HGST    HUH728080AL4204 VLJRTALV        7452.04 GiB     ----    [1] Slot00
SCSI    c1t5000CCA26099123Dd0   HGST    HUH728080AL4204 VLJR5PMV        7452.04 GiB     ----    [1] Slot01
SCSI    c1t5000CCA26099452Dd0   HGST    HUH728080AL4204 VLJR92TV        7452.04 GiB     ----    [1] Slot02
SCSI    c1t5000CCA260993641d0   HGST    HUH728080AL4204 VLJR82ZV        7452.04 GiB     ----    [1] Slot03
SCSI    c1t5000CCA26098FEB9d0   HGST    HUH728080AL4204 VLJR4DAV        7452.04 GiB     ----    [1] Slot04
SCSI    c1t5000CCA260999CFDd0   HGST    HUH728080AL4204 VLJRGY2V        7452.04 GiB     ----    [1] Slot05
SCSI    c1t5000CCA26098FE65d0   HGST    HUH728080AL4204 VLJR4BNV        7452.04 GiB     ----    [1] Slot06
SCSI    c1t5000CCA2609896C5d0   HGST    HUH728080AL4204 VLJPXGRV        7452.04 GiB     ----    [1] Slot07
SCSI    c1t5000CCA260993C5Dd0   HGST    HUH728080AL4204 VLJR8HLV        7452.04 GiB     ----    [1] Slot08
SCSI    c1t5000CCA2609930E5d0   HGST    HUH728080AL4204 VLJR7RXV        7452.04 GiB     ----    [1] Slot09
SCSI    c1t5000CCA260990AC5d0   HGST    HUH728080AL4204 VLJR566V        7452.04 GiB     ----    [1] Slot10
SCSI    c1t5000CCA26099B4F5d0   HGST    HUH728080AL4204 VLJRJJKV        7452.04 GiB     ----    [1] Slot11
USB     c2t0d0  Kingston        DataTraveler 2.0        -         14.44 GiB     ??R-    -
  ---- errors --- 
  s/w h/w trn tot device
    0   0   0   0 lofi1
    0   0   0   0 ramdisk1
    0   0   0   0 c2t0d0
    0   0   0   0 c1t5000CCA0496F846Dd0
    0   0   1   1 c1t5000CCA260993DA5d0
    0   0   0   0 c1t5000CCA26098965Dd0
    0   0   0   0 c1t5000CCA2609892D5d0
    0   0   0   0 c1t5000CCA26098D7FDd0
    0   0   0   0 c1t5000CCA260989BF9d0
    0   0   0   0 c1t5000CCA260999C11d0
    0   0 114830 114830 c1t5000CCA2609A4BEDd0
    0   0   1   1 c1t5000CCA2609909B9d0
    0   0   0   0 c1t5000CCA26098E5F9d0
    0   0   0   0 c1t5000CCA2609997F1d0
    0   0   0   0 c1t5000CCA2609992D1d0
    0   0   0   0 c1t5000CCA260998EC1d0
    0   0   0   0 c1t5000CCA2609909F9d0
    0   0   0   0 c1t5000CCA2609055DDd0
    0   0   0   0 c1t5000CCA260999C6Dd0
    0   0   0   0 c1t5000CCA2609A1AF5d0
    0   0   0   0 c1t5000CCA2609909FDd0
    0   0   0   0 c1t5000CCA26099330Dd0
    0   0   0   0 c1t5000CCA2609892BDd0
    0   0   0   0 c1t5000CCA260989359d0
    0   0   0   0 c1t5000CCA26099B8A1d0
    0   0   0   0 c1t5000CCA260999739d0
    0   0   0   0 c1t5000CCA2609A1DF5d0
    0   0   0   0 c1t5000CCA2609A2A31d0
    0   0   0   0 c1t5000CCA26099123Dd0
    0   0   0   0 c1t5000CCA26099452Dd0
    0   0   0   0 c1t5000CCA260993641d0
    0   0   0   0 c1t5000CCA26098FEB9d0
    0   0   0   0 c1t5000CCA260999CFDd0
    0   0   0   0 c1t5000CCA26098FE65d0
    0   0   0   0 c1t5000CCA2609896C5d0
    0   0   0   0 c1t5000CCA260993C5Dd0
    0   0   0   0 c1t5000CCA2609930E5d0
    0   0   0   0 c1t5000CCA260990AC5d0
    0   0   0   0 c1t5000CCA26099B4F5d0
    0   0   0   0 zones

[mhicks@headnode (eu-central-1c) ~]$ sdc-oneachnode -n "$SERVLIST" 'ZPS="$(zpool status)";  DI="$(diskinfo -cH)"; diff -wu <(echo "$ZPS"| grep c1 | awk "{print \$1}"| sort) <(echo "$DI"| awk "{print \$2}"| sort); echo "Zpool has $(echo "$ZPS"| grep -c c1) disks, diskinfo has $(echo "$DI"| wc -l) disks"; echo "$ZPS"; echo "$DI"; iostat -en'^C
[mhicks@headnode (eu-central-1c) ~]$  sdc-oneachnode -n "$SERVLIST" 'grep -c "drive offline" /var/adm/messages'
HOSTNAME              STATUS
MSB19562              62
[mhicks@headnode (eu-central-1c) ~]$ man grep
[mhicks@headnode (eu-central-1c) ~]$  sdc-oneachnode -n "$SERVLIST" 'grep -lc "drive offline" /var/adm/messages*'
=== Output from 00000000-0000-0000-0000-0cc47adebaac (MSB19562):
/var/adm/messages:66
/var/adm/messages.0:20272
/var/adm/messages.1:20272
/var/adm/messages.2:9330
/var/adm/messages.3:112

Comments

Comment by Michael Hicks
Created at 2018-02-14T20:56:53.536Z

on MSA08637 `fmdump -e` is full of

Feb 14 12:06:00.5135 ereport.io.scsi.cmd.disk.tran   
Feb 14 12:06:01.5136 ereport.io.scsi.cmd.disk.tran   
Feb 14 12:41:00.7600 ereport.io.scsi.cmd.disk.tran   
Feb 14 12:41:01.7601 ereport.io.scsi.cmd.disk.tran   
Feb 14 12:41:16.7192 ereport.io.scsi.cmd.disk.tran   
Feb 14 12:41:17.7193 ereport.io.scsi.cmd.disk.tran   
Feb 14 15:06:01.1137 ereport.io.scsi.cmd.disk.tran   
Feb 14 15:06:02.1138 ereport.io.scsi.cmd.disk.tran   
Feb 14 18:06:00.4353 ereport.io.scsi.cmd.disk.tran   
Feb 14 18:06:01.4354 ereport.io.scsi.cmd.disk.tran

Comment by Kody Kantor [X]
Created at 2018-12-19T22:17:39.491Z
Updated at 2018-12-20T15:33:44.066Z

I've been investigating this. It looks like there are a few improvements that could be made in this area.

I created a simple pool with one mirror vdev and a single spare:

[root@coke /var/tmp/zinject]# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 13.5K in 0h0m with 0 errors on Wed Dec 19 16:48:15 2018
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    AVAIL

The work of generating ZFS IO ereports is done at the end of the ZIO pipeline and sent on the FM sysevent error channel. This is picked up by FMD, and sent to the ZFS FM decision engine module. The ZFS decision engine will then create a fault associated with the ereport(s). Many ereports can be bundled into the same FM fault.

Eventually the fault generated by the ZFS decision engine makes its way into the FM ZFS 'retire' agent. This agent is responsible for taking action on faults. That action is usually faulting or degrading pools or vdevs, and then attempting replacement with a hot spare (if available). The hot spares attached to a pool are iterated through for the replacement. If one spare fails to replace a failed child vdev the next will be tried, and so on. It looks like if a spare fails to enter a pool a ZFS error is thrown, but no FMA events are kicked off (I haven't look too closely at this case yet).

The logic in the ZFS retire module does not account for 'spare' vdevs. Specifically, the logic to find the child vdev in the pool's vdev list neglects to look in the top-level 'spare' vdev. It will check normal (raidz, mirror, no-redundancy) top-level vdevs and look for l2arc devices.

The ereports listed in this ticket (ereport.io.scsi.cmd.disk.tran) are generic IO errors so they don't come directly from the ZFS stack. If I understand the ZFS retire agent code correctly, those are received by the retire agent, but not acted upon because the retire agent can't find the vdev in the pool (because it doesn't look hard enough).

Directly faulting a spare child vdev seems to do the right thing and lists the spare device as faulted:

[root@coke /var/tmp/zinject]# ./zinject -d c2t3d0 -A fault test

[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 13.5K in 0h0m with 0 errors on Wed Dec 19 21:41:57 2018
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    FAULTED   too many errors

errors: No known data errors

These are what I think our next steps should be:


Comment by Kody Kantor [X]
Created at 2018-12-20T16:30:05.051Z
Updated at 2018-12-20T16:30:24.945Z

I ran a test to see what would happen if we tried to 'spare in' a faulted spare.

This logic should also be changed. If a spare is known to be faulted it shouldn't be used to replace a faulted device. I think if the device had some hardware problems the replacement would fail anyway, but we should improve the logic here nonetheless.

I also discovered that ZFS won't clear errors for spare devices.

A walkthrough of the operations I used to conclude this:

[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    AVAIL

errors: No known data errors

## fault the spare vdev.
[root@coke /var/tmp/zinject]# ./zinject -d c2t3d0 -A fault test

[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    FAULTED   too many errors

errors: No known data errors

## inject IO errors directly into FMA for one side of the mirror vdev.
[root@coke /var/tmp/fminject]# ./fminject injection.txt

## the previously faulted spare is now in use.
[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: resilvered 127K in 0h0m with 0 errors on Thu Dec 20 16:19:19 2018
config:

	NAME          STATE     READ WRITE CKSUM
	test          DEGRADED     0     0     0
	  mirror-0    DEGRADED     0     0     0
	    c2t1d0    ONLINE       0     0     0
	    spare-1   DEGRADED     0     0     0
	      c2t2d0  FAULTED      0     0     0  too many errors
	      c2t3d0  ONLINE       0     0     0
	spares
	  c2t3d0      INUSE     currently in use

errors: No known data errors

## 'fix' the pool by clearing, scrubbing, and clearing again.
[root@coke /var/tmp/fminject]# zpool clear test && zpool scrub test && sleep 1 && zpool clear test

## the spare is back to 'faulted'
[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.50K in 0h0m with 0 errors on Thu Dec 20 16:22:10 2018
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    FAULTED   too many errors

errors: No known data errors

## ZFS won't clear errors for hot spares, so it has to be removed and re-added.
[root@coke /var/tmp/fminject]# zpool clear test c2t3d0
cannot clear errors for c2t3d0: device is reserved as a hot spare

## Work around not being able to clear errors from spare by removing and re-adding the spare:
[root@coke /var/tmp/fminject]# zpool remove test c2t3d0
[root@coke /var/tmp/fminject]# zpool add test spare c2t3d0
[root@coke /var/tmp/fminject]# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.50K in 0h0m with 0 errors on Thu Dec 20 16:22:10 2018
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    c2t1d0  ONLINE       0     0     0
	    c2t2d0  ONLINE       0     0     0
	spares
	  c2t3d0    AVAIL

errors: No known data errors

Comment by Kody Kantor [X]
Created at 2019-01-24T16:41:28.711Z
Updated at 2019-01-24T16:42:17.677Z

Testing notes:


Comment by Kody Kantor [X]
Created at 2019-01-24T17:59:39.950Z
Updated at 2019-01-24T18:00:15.765Z

I ran another test as well:


Comment by Kody Kantor [X]
Created at 2019-01-24T18:29:07.747Z

I created OS-7533, which affects this change. Because of that bug we can't safely set the spa_spare_poll_interval_seconds value to less than zfs_remove_timeout (currently set to 15 seconds by default).

In a related note, since vdev probe failure ereports aren't sent into SERD one ereport is enough to fault the vdev.


Comment by Kody Kantor [X]
Created at 2019-01-29T22:00:48.092Z

Testing notes continued.


Comment by Kody Kantor [X]
Created at 2019-02-06T18:22:10.131Z

I built a new OS image and ran the zfs test suite again. All the tests passed:

root@coke:/opt/zfs-tests/bin# ./zfstest
...
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_replace/setup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_replace/zpool_replace_001_neg (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_replace/zpool_replace_002_neg (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_replace/cleanup (run as root) [00:00] [PASS]
...
Results Summary
PASS	 423

Running Time:	01:20:45
Percent passed:	100.0%
Log directory:	/var/tmp/test_results/20190206T035726
you have mail in /var/mail/root
 

Comment by Jira Bot
Created at 2019-02-06T19:35:12.966Z

illumos-joyent commit f92efa0bbe42adba606c2523638d6238b731b3f1 (branch master, by Kody A Kantor)

OS-6602 ZFS not detecting faulty spares in a timely manner
OS-7499 ZFS retire agent cannot fault inactive spares
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>