|Priority:||4 - Normal|
|Created by:||Hans Rosenfeld|
|Reported by:||Hans Rosenfeld|
|Assigned to:||Hans Rosenfeld|
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-09-11T14:20:14.297Z)
2019-09-12 Art Vandelay (Release Date: 2019-09-12)
On a customer system, we've ran into these warnings at boot roughly once in ten reboots:
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): config header request timeout WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): Failed to get sas phy page 0 for each phy WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): mptsas phy update failed
The system would then take hours to complete its boot and not find any disks/SSDs attached to the affected SAS HBA.
As it turns out this is caused by a command timeout in mptsas_access_config_page(), as indicated by the first warning. The remaining warnings are directly caused by this failure. The timeout for config header requests (and also config page requests later in the function) are hardcoded to 60s.
It turns out there's an oversight in mptsas_access_config_page(): just as in other places in mpt_sas it already contains code to reset the I/O controller when a command timeout happened. This code is just never executed, leaving the controller in a hung state.
This can be easily fixed. It is still unclear what causes these timeouts in the first place. With the IOC reset being done properly the timeouts may still happen every now and then at reboot, but they won't cause any more harm than a single 60s delay at boot. All disks/SSDs are visible and usable.
Testing: I've manually rebooted the customers system a few times until the timeouts happened at boot. Previously a timeout meant the system would take ages to boot and then see none of its SSDs. With these changes implemented the system did experience a short delay (60s) due to the timeout, but once it finished booting all SSDs were visible and usable, the system behaved as expected.
illumos-joyent commit 526bf199ed8900945f9dffc3041ec322a758285c (branch master, by Hans Rosenfeld)
OS-7981 mpt_sas hangs after config header request timeout
Reviewed by: Robert Mustacchi <email@example.com>
Reviewed by: Jerry Jelinek <firstname.lastname@example.org>
Approved by: Jerry Jelinek <email@example.com>