In JPC production (us-sw-1), we saw a bunch of machines hang pretty hard several times. We were able to NMI them and get several dumps. In all cases we looked at, cross-calls were on the scene on all CPUs. I believe the cross-calls were coming from DTrace.
Not surprisingly, the outages appear highly correlated with CA activity. Both initially and after bringing CA up after a failed attempt to work around the problem, the first outages started about 15 minutes after the DTrace enablings were created. When CA was disabled, the systems became stable again.
There's one dump in bh1-www0:/data/147_dumps/xc_hang2, and more coming.
Linda Derezinski commented on 2011-11-18T01:11:33.000-0500:
These are the dumps from https://hub.joyent.com/jira/browse/SWONE-26?focusedCommentId=36305&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-36305
lderezinski@BYLY9R1 (us-sw-1) /var/tmp]$ scp vmdump.* stuff@stuff.joyent.us:linda
vmdump.4 0% | | 5248 KB 30:58 ETA
Former user commented on 2011-11-18T02:59:31.000-0500:
This dump (linda/*.4) is a little different to the others.
This time, two CPUs are not at IPL 15 (usually it's just one):
> *apic_cpus,0t16::print -at apic_cpus_info_t aci_curipl
ffffff198d692b93 uchar_t aci_curipl = 0xf
ffffff198d692bd3 uchar_t aci_curipl = 0xf
ffffff198d692c13 uchar_t aci_curipl = 0xf
ffffff198d692c53 uchar_t aci_curipl = 0xf
ffffff198d692c93 uchar_t aci_curipl = 0xf
ffffff198d692cd3 uchar_t aci_curipl = 0xf
ffffff198d692d13 uchar_t aci_curipl = 0xf
ffffff198d692d53 uchar_t aci_curipl = 0xf
ffffff198d692d93 uchar_t aci_curipl = 0xf
ffffff198d692dd3 uchar_t aci_curipl = 0xf
ffffff198d692e13 uchar_t aci_curipl = 0
ffffff198d692e53 uchar_t aci_curipl = 0xf
ffffff198d692e93 uchar_t aci_curipl = 0
ffffff198d692ed3 uchar_t aci_curipl = 0xf
ffffff198d692f13 uchar_t aci_curipl = 0xf
ffffff198d692f53 uchar_t aci_curipl = 0xf
There's only one thread in xc_common() (it's usually two or three):
> ::stacks -c xc_common
THREAD STATE SOBJ COUNT
ffffff00b8a98c40 PANIC <NONE> 1
0xb
panic+0x94
apic_nmi_intr+0x7c
av_dispatch_nmivect+0x30
nmiint+0x152
xc_serv+0x12e
xc_common+0x229
xc_sync+0x46
dtrace_xcall+0x8c
dtrace_sync+0x18
dtrace_state_deadman+0x1c
cyclic_softint+0xdc
cbe_low_level+0x17
av_dispatch_softvect+0x5f
dispatch_softint+0x34
And only one dtrace enabling (it was usually more than ten):
> ::dtrace_state
ADDR MINOR PROC NAME FILE
ffffff19c9033080 5 ffffff19d4eb00b0 dtrace ffffff19d41d72c8
The last two details may make analysis of this dump easier than others.
Former user commented on 2011-11-18T03:19:21.000-0500:
The dtrace xc_common() stack also received the NMI:
> $C
ffffff00b8a987c0 vpanic()
ffffff00b8a987e0 apic_nmi_intr+0x7c(0, ffffff00b8a98820)
ffffff00b8a98810 av_dispatch_nmivect+0x30(ffffff00b8a98820)
ffffff00b8a98820 nmiint+0x152()
ffffff00b8a98980 xc_serv+0x12e(0, 0)
ffffff00b8a98a10 xc_common+0x229(fffffffffb81a460, fffffffffb81a568, 0, 0, ffffff00b8a98a60, 3)
ffffff00b8a98a50 xc_sync+0x46(fffffffffb81a568, 0, 0, ffffff00b8a98a60, fffffffffb81a460)
ffffff00b8a98ab0 dtrace_xcall+0x8c(ffffffff, fffffffffb81a568, 0)
ffffff00b8a98ac0 dtrace_sync+0x18()
ffffff00b8a98af0 dtrace_state_deadman+0x1c(ffffff19c9033080)
ffffff00b8a98ba0 cyclic_softint+0xdc(ffffff1998b84b00, 0)
ffffff00b8a98bb0 cbe_low_level+0x17()
ffffff00b8a98bf0 av_dispatch_softvect+0x5f(2)
ffffff00b8a98c20 dispatch_softint+0x34(0, 0)
ffffff00b8a62a30 switch_sp_and_call+0x13()
ffffff00b8a62a60 dosoftint+0x59(ffffff00b8a62ac0)
ffffff00b8a62ab0 do_interrupt+0x114(ffffff00b8a62ac0, 0)
ffffff00b8a62ac0 _interrupt+0xba()
ffffff00b8a62bb0 i86_mwait+0xd()
ffffff00b8a62c00 cpu_idle_mwait+0xf1()
ffffff00b8a62c20 idle+0x114()
ffffff00b8a62c30 thread_start+8()
It was running xc_serv+0x12e:
> xc_serv+0x12e::dis
xc_serv+0x10a: movq %r15,%rdi
xc_serv+0x10d: movq -0x58(%rbp),%rsi
xc_serv+0x111: movq %r14,%rdx
xc_serv+0x114: xorl %eax,%eax
xc_serv+0x116: call *%r13
xc_serv+0x119: movq -0x50(%rbp),%r8
xc_serv+0x11d: movl 0x644(%r8),%eax
xc_serv+0x124: testl %eax,%eax
xc_serv+0x126: je +0x218 <xc_serv+0x344>
xc_serv+0x12c: pause
xc_serv+0x12e: movq %r12,%rdi
xc_serv+0x131: call -0x40e <xc_extract>
xc_serv+0x136: movq %rax,%rbx
xc_serv+0x139: testq %rbx,%rbx
pause. Which looks like it is waiting for a message to arrive:
xc_serv(caddr_t arg1, caddr_t arg2)
{
struct machcpu *mcpup = &(CPU->cpu_m);
xc_msg_t *msg;
xc_data_t *data;
xc_msg_t *xc_waiters = NULL;
uint32_t num_waiting = 0;
xc_func_t func;
xc_arg_t a1;
xc_arg_t a2;
xc_arg_t a3;
uint_t rc = DDI_INTR_UNCLAIMED;
while (mcpup->xc_work_cnt != 0) {
rc = DDI_INTR_CLAIMED;
/*
* We may have to wait for a message to arrive.
*/
for (msg = NULL; msg == NULL;
msg = xc_extract(&mcpup->xc_msgbox)) {
/*
* Alway check for and handle a priority message.
*/
if (BT_TEST(xc_priority_set, CPU->cpu_id)) {
func = xc_priority_data.xc_func;
a1 = xc_priority_data.xc_a1;
a2 = xc_priority_data.xc_a2;
a3 = xc_priority_data.xc_a3;
XC_BT_CLEAR(xc_priority_set, CPU->cpu_id);
xc_decrement(mcpup);
func(a1, a2, a3);
if (mcpup->xc_work_cnt == 0)
return (rc);
}
/*
* wait for a message to arrive
*/
SMT_PAUSE();
}
[...]
Checking xc_work_cnt by CPU:
> ::walk cpu | ::print -at cpu_t cpu_m.xc_work_cnt
fffffffffbc30ae4 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998a07684 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998a04144 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998a02b44 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998a01644 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff19989f8bc4 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff19989f76c4 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff19989f60c4 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff19989f4b84 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff19989f3684 volatile uint32_t cpu_m.xc_work_cnt = 0x1
fffffffffbc3b294 volatile uint32_t cpu_m.xc_work_cnt = 0x11
ffffff1998b81b44 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998b80644 volatile uint32_t cpu_m.xc_work_cnt = 0x3
ffffff1998b7f104 volatile uint32_t cpu_m.xc_work_cnt = 0x1
ffffff1998b7bbc4 volatile uint32_t cpu_m.xc_work_cnt = 0x2
ffffff1998b7a6c4 volatile uint32_t cpu_m.xc_work_cnt = 0x1
and xc_wait_cnt:
> ::walk cpu | ::print -at cpu_t cpu_m.xc_wait_cnt
fffffffffbc30ae0 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998a07680 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998a04140 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998a02b40 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998a01640 uint32_t cpu_m.xc_wait_cnt = 0
ffffff19989f8bc0 uint32_t cpu_m.xc_wait_cnt = 0
ffffff19989f76c0 uint32_t cpu_m.xc_wait_cnt = 0
ffffff19989f60c0 uint32_t cpu_m.xc_wait_cnt = 0
ffffff19989f4b80 uint32_t cpu_m.xc_wait_cnt = 0
ffffff19989f3680 uint32_t cpu_m.xc_wait_cnt = 0
fffffffffbc3b290 uint32_t cpu_m.xc_wait_cnt = 0x10
ffffff1998b81b40 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998b80640 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998b7f100 uint32_t cpu_m.xc_wait_cnt = 0
ffffff1998b7bbc0 uint32_t cpu_m.xc_wait_cnt = 0x1
ffffff1998b7a6c0 uint32_t cpu_m.xc_wait_cnt = 0
CPU ID 10 (with the highest wait cnt) is also the CPU of the dtrace xcall and NMI panic:
> ffffff00b8a98c40::print -t kthread_t t_cpu->cpu_id
processorid_t t_cpu->cpu_id = 0xa
Linda Derezinski commented on 2011-11-18T07:51:44.000-0500:
ok added the last three:
Brendan already looked at vmdump.4
[stuff@bh1-autobuild ~/linda]$ ls -lrht
total 13236200
-rw-r--r-- 1 stuff other 1.3G Nov 17 22:21 vmdump.4
-rw-r--r-- 1 stuff other 2.0M Nov 17 23:35 unix.4
-rw-r--r-- 1 stuff other 5.6G Nov 17 23:35 vmcore.4
-rw-r--r-- 1 stuff other 1.4G Nov 18 04:07 vmdump.3
-rw-r--r-- 1 stuff other 2.0G Nov 18 04:23 vmdump.1
-rw-r--r-- 1 stuff other 1.2G Nov 18 04:41 vmdump.2
Former user commented on 2011-11-18T15:39:48.000-0500:
I've consolidated all the dumps into /home/admin/DUMPS/ on the SW-1 Headnode. The dumps are still in compressed form, so use "savecore -f vmdump.0 ." to uncompress them in the appropriate directory. Timestamps of the dumps have been preserved but the perms have been set 777 to allow everyone access.
Former user commented on 2011-11-18T20:43:01.000-0500 (edited 2011-11-18T20:45:18.000-0500):
The following (incomplete) analysis is based on vmdump.0 from CYLY9R1, which is currently on labwww:/data/147_dumps/xc_hang2/vmdump.0. In this case, most of the CPUs are at IPL 15:
> ::walk cpu | ::print cpu_t cpu_m.mcpu_pri
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xc
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xa
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.mcpu_pri = 0xf
Since the system was basically hung, let's figure out where each CPU is. The results are below. Note that every thread is on the interrupt stack (see the definition of CPU_ON_INTR() in cpuvar.h), so we have to look at the interrupt stacks. We don't have "::findstack" for this, so I manually found the stacks:
cpu_id = 0
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b80d2000
> 0xffffff00b80d1ee0$C
ffffff00b80d1ef0 panic_idle+0xe()
ffffff00b80d1f70 xc_serv+0x119(0, 0)
ffffff00b80d1fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b80d1ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8005a60 switch_sp_and_call+0x13()
ffffff00b8005ab0 do_interrupt+0xf2(ffffff00b8005ac0, 0)
ffffff00b8005ac0 _interrupt+0xba()
ffffff00b8005bb0 i86_mwait+0xd()
ffffff00b8005c00 cpu_idle_mwait+0xf1()
ffffff00b8005c20 idle+0x114()
ffffff00b8005c30 thread_start+8()
cpu_id = 0x1
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b849c000
> 0xffffff00b849bee0$C
ffffff00b849bef0 panic_idle+0xe()
ffffff00b849bf70 xc_serv+0x119(0, 0)
ffffff00b849bfc0 av_dispatch_autovect+0x7c(f0)
ffffff00b849bff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b845fa60 switch_sp_and_call+0x13()
ffffff00b845fab0 do_interrupt+0xf2(ffffff00b845fac0, 0)
ffffff00b845fac0 _interrupt+0xba()
ffffff00b845fbb0 i86_mwait+0xd()
ffffff00b845fc00 cpu_idle_mwait+0xf1()
ffffff00b845fc20 idle+0x114()
ffffff00b845fc30 thread_start+8()
cpu_id = 0x2
cpu_intr_actv = 0x18004
cpu_intr_stack = 0xffffff00b865d000
> ffffff1998a03b00::print cpu_t cpu_thread | ::findstack
stack pointer for thread ffffff00b8656c40: ffffff00b8656720
ffffff00b8656770 0xf0()
ffffff00b86567c0 do_interrupt+0xf2()
ffffff00b86567d0 _interrupt+0xba()
ffffff00b86568f0 do_splx+0x8d()
ffffff00b8656900 panic_idle+0xe()
ffffff00b8656980 xc_serv+0x119()
ffffff00b8656a10 xc_common+0x229()
ffffff00b8656a50 xc_sync+0x46()
ffffff00b8656ab0 dtrace_xcall+0x8c()
ffffff00b8656ac0 dtrace_sync+0x18()
ffffff00b8656af0 dtrace_state_deadman+0x1c()
ffffff00b8656ba0 cyclic_softint+0xdc()
ffffff00b8656bb0 cbe_low_level+0x17()
ffffff00b8656bf0 av_dispatch_softvect+0x5f()
ffffff00b8656c20 dispatch_softint+0x34()
ffffff00b828ba30 switch_sp_and_call+0x13()
ffffff00b828ba60 dosoftint+0x59()
ffffff00b828bab0 do_interrupt+0x114()
ffffff00b828bac0 _interrupt+0xba()
ffffff00b828bbb0 i86_mwait+0xd()
ffffff00b828bc00 cpu_idle_mwait+0xf1()
ffffff00b828bc20 idle+0x114()
ffffff00b828bc30 thread_start+8()
Note that since panic_idle() goes into an infinite loop, it's possible that we're not actually in do_splx() from there any more, but rather that we were at one point and that's still on the stack.
cpu_id = 0x3
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8742000
> 0xffffff00b8741ee0$C
ffffff00b8741ef0 panic_idle+0xe()
ffffff00b8741f70 xc_serv+0x119(0, 0)
ffffff00b8741fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8741ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8705a60 switch_sp_and_call+0x13()
ffffff00b8705ab0 do_interrupt+0xf2(ffffff00b8705ac0, 0)
ffffff00b8705ac0 _interrupt+0xba()
ffffff00b8705bb0 i86_mwait+0xd()
ffffff00b8705c00 cpu_idle_mwait+0xf1()
ffffff00b8705c20 idle+0x114()
ffffff00b8705c30 thread_start+8()
cpu_id = 0x4
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b87e9000
> 0xffffff00b87e8ee0$C
ffffff00b87e8ef0 panic_idle+0xe()
ffffff00b87e8f70 xc_serv+0x119(0, 0)
ffffff00b87e8fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b87e8ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00bcc8feb0 switch_sp_and_call+0x13()
ffffff00bcc8ff00 do_interrupt+0xf2(ffffff00bcc8ff10, 0)
ffffff00bcc8ff10 _interrupt+0xba()
cpu_id = 0x5
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b88a6000
> 0xffffff00b88a5ee0$C
ffffff00b88a5ef0 panic_idle+0xe()
ffffff00b88a5f70 xc_serv+0x119(0, 0)
ffffff00b88a5fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b88a5ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8869a60 switch_sp_and_call+0x13()
ffffff00b8869ab0 do_interrupt+0xf2(ffffff00b8869ac0, 0)
ffffff00b8869ac0 _interrupt+0xba()
ffffff00b8869bb0 i86_mwait+0xd()
ffffff00b8869c00 cpu_idle_mwait+0xf1()
ffffff00b8869c20 idle+0x114()
ffffff00b8869c30 thread_start+8()
cpu_id = 0x6
cpu_intr_actv = 0x18000
cpu_intr_stack = 0xffffff00b890b000
> 0xffffff00b890acc0$C
ffffff00b890adb0 panic+0x94()
ffffff00b890add0 apic_nmi_intr+0x7c(0, ffffff00b890ae10)
ffffff00b890ae00 av_dispatch_nmivect+0x30(ffffff00b890ae10)
ffffff00b890ae10 nmiint+0x152()
---> I don't think these are actually connected because nmiint gets invoked by the processor. This is also suggested because the disassembly at xc_serv+0x12e doesn't show a function call. <--
ffffff00b890af70 xc_serv+0x12e(0, 0)
ffffff00b890afc0 av_dispatch_autovect+0x7c(f0)
ffffff00b890aff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00bd16b1b0 switch_sp_and_call+0x13()
ffffff00bd16b200 do_interrupt+0xf2(ffffff00bd16b210, 0)
ffffff00bd16b210 _interrupt+0xba()
ffffff00bd16b310 pic_update_irq+0x53()
ffffff00bd16b350 kvm_pic_set_irq+0x69()
ffffff00bd16b380 kvm_set_pic_irq+0x54()
ffffff00bd16b460 kvm_set_irq+0x116()
ffffff00bd16bc70 kvm_ioctl+0xd00()
ffffff00bd16bcb0 cdev_ioctl+0x45(3400000014, c008ae67, fffffd7fffdf9370, 202003,
ffffff4a853a1640, ffffff00bd16bd94)
ffffff00bd16bcf0 spec_ioctl+0x5a(ffffff19eaa49800, c008ae67, fffffd7fffdf9370, 202003
, ffffff4a853a1640, ffffff00bd16bd94, 0)
ffffff00bd16bd70 fop_ioctl+0x7b(ffffff19eaa49800, c008ae67, fffffd7fffdf9370, 202003,
ffffff4a853a1640, ffffff00bd16bd94, 0)
ffffff00bd16be70 ioctl+0x18e(6, c008ae67, fffffd7fffdf9370)
ffffff00bd16bf00 dtrace_systrace_syscall+0x11a(6, c008ae67, fffffd7fffdf9370,
fffffd7fffdf93cc, 4ec5c74c, 2ac0660d)
ffffff00bd16bf10 sys_syscall+0x17a()
cpu_id = 0x7
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8970000
> 0xffffff00b896fee0$C
ffffff00b896fef0 panic_idle+0xe()
ffffff00b896ff70 xc_serv+0x119(0, 0)
ffffff00b896ffc0 av_dispatch_autovect+0x7c(f0)
ffffff00b896fff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8933a60 switch_sp_and_call+0x13()
ffffff00b8933ab0 do_interrupt+0xf2(ffffff00b8933ac0, 0)
ffffff00b8933ac0 _interrupt+0xba()
ffffff00b8933bb0 i86_mwait+0xd()
ffffff00b8933c00 cpu_idle_mwait+0xf1()
ffffff00b8933c20 idle+0x114()
ffffff00b8933c30 thread_start+8()
cpu_id = 0x8
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b89d5000
> 0xffffff00b89d4ee0$C
ffffff00b89d4ef0 panic_idle+0xe()
ffffff00b89d4f70 xc_serv+0x119(0, 0)
ffffff00b89d4fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b89d4ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8998a60 switch_sp_and_call+0x13()
ffffff00b8998ab0 do_interrupt+0xf2(ffffff00b8998ac0, 0)
ffffff00b8998ac0 _interrupt+0xba()
ffffff00b8998bb0 i86_mwait+0xd()
ffffff00b8998c00 cpu_idle_mwait+0xf1()
ffffff00b8998c20 idle+0x114()
ffffff00b8998c30 thread_start+8()
cpu_id = 0x9
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8a3a000
> 0xffffff00b8a39ee0$C
ffffff00b8a39ef0 panic_idle+0xe()
ffffff00b8a39f70 xc_serv+0x119(0, 0)
ffffff00b8a39fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8a39ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b89fda60 switch_sp_and_call+0x13()
ffffff00b89fdab0 do_interrupt+0xf2(ffffff00b89fdac0, 0)
ffffff00b89fdac0 _interrupt+0xba()
ffffff00b89fdbb0 i86_mwait+0xd()
ffffff00b89fdc00 cpu_idle_mwait+0xf1()
ffffff00b89fdc20 idle+0x114()
ffffff00b89fdc30 thread_start+8()
cpu_id = 0xa
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8a9f000
> 0xffffff00b8a9eee0$C
ffffff00b8a9eef0 panic_idle+0xe()
ffffff00b8a9ef70 xc_serv+0x119(0, 0)
ffffff00b8a9efc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8a9eff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00bae1db50 switch_sp_and_call+0x13()
ffffff00bae1dba0 do_interrupt+0xf2(ffffff00bae1dbb0, 23)
ffffff00bae1dbb0 _interrupt+0xba()
ffffff00bae1de60 dtrace_interrupt_enable+2()
ffffff00bae1df00 dtrace_systrace_syscall+0x89(0, 0, fffffd7fffdf93e0, 0, 4ec5c74c,
2ac0b7b0)
ffffff00bae1df10 sys_syscall+0x17a()
cpu_id = 0xb
cpu_intr_actv = 0x18000
cpu_intr_stack = 0xffffff00b8b04000
> 0xffffff00b8b03e00$C
ffffff00b8b03ef0 panic_idle+0x27()
ffffff00b8b03f70 xc_serv+0x119(0, 0)
ffffff00b8b03fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8b03ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00bb0392a0 switch_sp_and_call+0x13()
ffffff00bb0392f0 do_interrupt+0xf2(ffffff00bb039300, ffffff6c063d2028)
ffffff00bb039300 _interrupt+0xba()
ffffff00bb039400 vcpu_enter_guest+0x31c()
ffffff00bb039420 __vcpu_run+0x77()
ffffff00bb039460 kvm_arch_vcpu_ioctl_run+0x110()
ffffff00bb039c70 kvm_ioctl+0x11c()
ffffff00bb039cb0 cdev_ioctl+0x45(3400000015, 2000ae80, 0, 202003, ffffff4a853a1640,
ffffff00bb039d94)
ffffff00bb039cf0 spec_ioctl+0x5a(ffffff19e3102540, 2000ae80, 0, 202003,
ffffff4a853a1640, ffffff00bb039d94, 0)
ffffff00bb039d70 fop_ioctl+0x7b(ffffff19e3102540, 2000ae80, 0, 202003,
ffffff4a853a1640, ffffff00bb039d94, 0)
ffffff00bb039e70 ioctl+0x18e(e, 2000ae80, 0)
ffffff00bb039f00 dtrace_systrace_syscall+0x11a(e, 2000ae80, 0, 0, fffffd7fff2d7484,
2991ef9d)
ffffff00bb039f10 sys_syscall+0x17a()
cpu_id = 0xc
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8b69000
> 0xffffff00b8b68ee0$C
ffffff00b8b68ef0 panic_idle+0xe()
ffffff00b8b68f70 xc_serv+0x119(0, 0)
ffffff00b8b68fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8b68ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8b2ca60 switch_sp_and_call+0x13()
ffffff00b8b2cab0 do_interrupt+0xf2(ffffff00b8b2cac0, 0)
ffffff00b8b2cac0 _interrupt+0xba()
ffffff00b8b2cbb0 i86_mwait+0xd()
ffffff00b8b2cc00 cpu_idle_mwait+0xf1()
ffffff00b8b2cc20 idle+0x114()
ffffff00b8b2cc30 thread_start+8()
cpu_id = 0xd
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8bce000
> 0xffffff00b8bcdee0$C
ffffff00b8bcdef0 panic_idle+0xe()
ffffff00b8bcdf70 xc_serv+0x119(0, 0)
ffffff00b8bcdfc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8bcdff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8b91a60 switch_sp_and_call+0x13()
ffffff00b8b91ab0 do_interrupt+0xf2(ffffff00b8b91ac0, 0)
ffffff00b8b91ac0 _interrupt+0xba()
ffffff00b8b91bb0 i86_mwait+0xd()
ffffff00b8b91c00 cpu_idle_mwait+0xf1()
ffffff00b8b91c20 idle+0x114()
ffffff00b8b91c30 thread_start+8()
cpu_id = 0xe
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8c33000
> 0xffffff00b8c32ee0$C
ffffff00b8c32ef0 panic_idle+0xe()
ffffff00b8c32f70 xc_serv+0x119(0, 0)
ffffff00b8c32fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8c32ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8bf6a60 switch_sp_and_call+0x13()
ffffff00b8bf6ab0 do_interrupt+0xf2(ffffff00b8bf6ac0, 0)
ffffff00b8bf6ac0 _interrupt+0xba()
ffffff00b8bf6bb0 i86_mwait+0xd()
ffffff00b8bf6c00 cpu_idle_mwait+0xf1()
ffffff00b8bf6c20 idle+0x114()
ffffff00b8bf6c30 thread_start+8()
cpu_id = 0xf
cpu_intr_actv = 0x28000
cpu_intr_stack = 0xffffff00b8c98000
> 0xffffff00b8c97ee0$C
ffffff00b8c97ef0 panic_idle+0xe()
ffffff00b8c97f70 xc_serv+0x119(0, 0)
ffffff00b8c97fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8c97ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8c5ba60 switch_sp_and_call+0x13()
ffffff00b8c5bab0 do_interrupt+0xf2(ffffff00b8c5bac0, 0)
ffffff00b8c5bac0 _interrupt+0xba()
ffffff00b8c5bbb0 i86_mwait+0xd()
ffffff00b8c5bc00 cpu_idle_mwait+0xf1()
ffffff00b8c5bc20 idle+0x114()
ffffff00b8c5bc30 thread_start+8()
Contrary to conclusions we drew earlier in the day, this means that we did enter panic_idle() on all of the non-panicking CPUs. Since panic_idle() sets the IPL to 10 but they're now all at 15, we can conclude that on each of these CPUs we took a level-15 interrupt after entering panic_idle() and never cleared it. This could happen if some other CPU (which presumably hasn't processed its priority panic_idle cross-call yet) sends a CPU a soft interrupt for a cross-call after that CPU has processed its priority panic_idle cross-call. That will kick the CPU back into the level-15 handler, where it will process the new cross call. (It's unclear why we didn't set the IPL back down again after that; a next step might be to look further up these stacks and see if we were still handling those interrupts, though I think I tried this and didn't find anything useful.)
This means that we can't conclude that we were at IPL 15 before the panic, at least not from the final IPL. However, panic_ipl is 15:
> panic_ipl/p
panic_ipl:
panic_ipl: 0xf
and the fact that there were enough cross-calls to hit the above race on so many CPUs suggests that we probably were spending a lot of time at IPL 15. On the other hand, most of the panic stacks indicate we were idle immediately before receiving the panic_idle() cross-call (but maybe that's because most of the activity is happening in interrupt context at the same IPL, so it won't show up on the stack).
None of this gets us much closer to the question of what happened before the panic.
Former user commented on 2011-11-19T22:53:01.000-0500:
Updates from some more analysis today of the same crash dump:
First, there was some question about whether it was unusual that all CPUs were in the QUIESCED state. However, I saw this in a dump taken via NMI on an otherwise fine system, so this is not significant.
Next, I looked at which cross-calls we had made most recently on each CPU when we panicked. Since CA appeared to trigger the problem, I'd assumed we were doing mostly DTrace-related cross-calls, but actually most of them were coming from KVM:
::walk cpu | ::print cpu_t cpu_m.xc_data.xc_func ! sort | uniq -c
4 cpu_m.xc_data.xc_func = dtrace_xcall_func
1 cpu_m.xc_data.xc_func = hati_demap_func
11 cpu_m.xc_data.xc_func = kvm_xcall_func
The kvm cross-calls were mostly for __vcpu_clear. Robert indicated that this was expected as we migrate VCPUs between physical CPUs, particularly on a busy system. Here's the VM configuration by number of VMs and VCPUs:
> ::walk kvm | ::print struct kvm online_vcpus
online_vcpus = 0x3
online_vcpus = 0x1
online_vcpus = 0x1
online_vcpus = 0x1
online_vcpus = 0x1
online_vcpus = 0x2
online_vcpus = 0x1
online_vcpus = 0x1
online_vcpus = 0x1
In attempt to reproduce the original problem I replicated this VM configuration on bh1-kvm0 with a mix of windows, centos, and ubuntu VMs. I also replicated the CA configuration by creating the same per-customer instrumentations we had in production around that time plus a few more global instrumentations. Finally, I set up a smartosplus64 zone and ran sysbench against MySQL to generate load for the MySQL probes (which we believe was going on in production). The system is now pretty busy according to mpstat:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 4 0 2838 3741 441 3389 529 888 114 0 8132 15 78 0 7
1 0 0 4040 3417 391 3549 594 932 131 0 8591 15 77 0 8
2 3 0 3234 3097 228 1986 381 466 45 0 6157 29 66 0 5
3 273 0 1270 3466 384 2306 369 634 106 0 6688 11 82 0 7
4 3 0 31 3855 319 3240 513 885 100 0 8687 13 79 0 8
5 0 0 5120 3188 439 3442 559 903 96 0 8546 15 76 0 9
6 57 0 543 3685 392 2602 441 673 83 0 6692 13 80 0 7
7 3 0 32 3893 322 3292 532 887 105 0 8520 15 75 0 10
8 23 0 45 4008 387 3117 530 805 94 0 7982 15 76 0 9
9 0 0 5798 3686 609 3167 511 823 74 0 7603 14 77 0 9
10 59 0 2407 3115 333 2110 336 557 88 0 5427 11 82 0 7
11 0 0 1670 3619 406 3203 550 748 107 0 7671 17 74 0 9
12 2 0 430 3849 428 3342 570 861 131 0 8574 13 76 0 11
13 43 0 1054 3817 410 2880 501 809 118 0 7092 15 76 0 9
14 126 0 16 3578 232 2240 367 633 94 0 5732 31 62 0 7
15 1 0 7629 2205 304 1960 293 514 74 0 6323 13 80 0 7
but it's still quite responsive. I've been pinging it continually from labwww, too. The vast majority of the time, it responds within 2ms (mostly within 700us). With the KVM load applied, the ping times are essentially unchanged except for occasional spikes up to as much as 600ms, but those spikes are hard to reproduce.
I dug a bit into the ping latency bubbles in hopes that they might shed some light on some periodic behavior that, if exacerbated, might help explain the system going totally unresponsive. I wrote /var/tmp/intrtime, which uses the per-CPU intrstat kstats to print the time spent per-CPU per-IPL per-second in microseconds. But I wasn't able to observe a correlation between the ping latency spikes and time spent at various interrupt levels. Here's what it looked like (with IPLs across the top and CPUs across the left):
CPU 5 6 7 8 9 10 11 12 13 14 15
0 0 387 0 0 0 2734 105 0 0 13533 105844
1 0 0 0 0 0 71 120 0 0 12707 97858
2 0 0 0 0 0 68 112 0 0 11288 100502
3 57 0 0 0 0 49 100 0 0 11929 114687
4 0 0 0 0 0 69 118 0 0 13274 113374
5 0 0 0 0 0 72 107 0 0 11682 81851
6 0 90 0 0 7 45 124 0 0 12436 104393
7 0 26 0 0 56 82 100 0 0 11404 107439
8 0 0 0 0 27 48 108 0 0 12022 107002
9 0 0 0 0 27 60 98 0 0 12339 81893
10 0 0 0 0 0 45 107 0 0 11777 90222
11 0 0 0 0 0 167 80 0 0 11302 94078
12 0 0 0 0 0 112 133 0 0 11895 97525
13 0 0 0 0 0 65 93 0 0 11457 98173
14 0 0 0 0 0 51 96 0 0 10931 97281
15 0 0 0 0 0 76 49 0 0 10291 52161
Visually, it appeared pretty consistent over time. We're obviously spending a lot of time (up to 10%) at IPL 15 and a bunch at IPL 14 as well. But it doesn't seem to account for the ping latency and it's hard to see how this could get so bad as to make the system totally wedge up.
Former user commented on 2011-11-21T21:20:51.000-0500 (edited 2011-11-21T21:22:31.000-0500):
We have a couple of new findings today. First, when I dumped the per-CPU interrupt stacks, I didn't go back far enough. Consider CPU0 (intr stack 0xffffff00b80d2000):
> 0xffffff00b80d1c60$C
ffffff00b80d1ce0 xc_serv+0x136(0, 0)
ffffff00b80d1d30 av_dispatch_autovect+0x7c(f0)
ffffff00b80d1d60 dispatch_hilevel+0x1f(f0, 0)
ffffff00b80d1db0 do_interrupt+0xd8(ffffff00b80d1dc0, 0)
ffffff00b80d1dc0 _interrupt+0xba()
ffffff00b80d1ee0 do_splx+0x8d(a)
ffffff00b80d1ef0 panic_idle+0xe()
ffffff00b80d1f70 xc_serv+0x119(0, 0)
ffffff00b80d1fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b80d1ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00b8005a60 switch_sp_and_call+0x13()
ffffff00b8005ab0 do_interrupt+0xf2(ffffff00b8005ac0, 0)
ffffff00b8005ac0 _interrupt+0xba()
ffffff00b8005bb0 i86_mwait+0xd()
ffffff00b8005c00 cpu_idle_mwait+0xf1()
ffffff00b8005c20 idle+0x114()
ffffff00b8005c30 thread_start+8()
In panic_idle(), as we call splx() to lower the IPL to 10, we take an interrupt for xc_serv (a cross-call interrupt). We're stuck at xc_serv+0x136, which is:
xc_serv+0x12e: movq %r12,%rdi
xc_serv+0x131: call -0x40e <xc_extract>
xc_serv+0x136: movq %rax,%rbx
xc_serv+0x139: testq %rbx,%rbx
This is the call to "xc_extract" in the "for" loop at line 329 in x_call.c. At this point in the code, we know that xc_work_cnt > 0 but xc_msgbox == NULL, which matches the in-memory state for CPU0. So it's just waiting for a message to show up in the msgbox. Since this thread is not in xc_common() itself, we can assume it's waiting for a message from the guy who's doing the cross call to indicate that it's complete. In fact, there's one guy doing a cross-call on the system, and it's CPU 2:
> ::stacks -c xc_common
THREAD STATE SOBJ COUNT
ffffff00b8656c40 ONPROC <NONE> 1
0xf0
do_interrupt+0xf2
_interrupt+0xba
do_splx+0x8d
panic_idle+0xe
xc_serv+0x119
xc_common+0x229
xc_sync+0x46
dtrace_xcall+0x8c
dtrace_sync+0x18
dtrace_state_deadman+0x1c
cyclic_softint+0xdc
cbe_low_level+0x17
av_dispatch_softvect+0x5f
dispatch_softint+0x34
> ffffff00b8656c40::print kthread_t t_cpu->cpu_id
t_cpu->cpu_id = 0x2
He's doing a synchronous cross call, which is going to block all CPUs at IPL 15 until all CPUs complete the cross call. Looking at the per-CPU cross-call state:
> ::walk cpu | ::print cpu_t cpu_id cpu_m.mcpu_pri cpu_m.xc_work_cnt cpu_m.xc_wait_cnt cpu_m
.xc_msgbox
cpu_id = 0
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x1
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x2
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x11
cpu_m.xc_wait_cnt = 0x10
cpu_m.xc_msgbox = 0
cpu_id = 0x3
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x4
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x5
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x6
cpu_m.mcpu_pri = 0xc
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x7
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x8
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0x9
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0xa
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0xb
cpu_m.mcpu_pri = 0xa
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0xffffff199899f2a8
cpu_id = 0xc
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0xd
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0xe
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
cpu_id = 0xf
cpu_m.mcpu_pri = 0xf
cpu_m.xc_work_cnt = 0x1
cpu_m.xc_wait_cnt = 0
cpu_m.xc_msgbox = 0
Now all CPUs except for 2 (the guy making the cross-call) and 11 are at IPL 15 waiting for exactly 1 cross-call message. This is consistent with a case where CPU 11 just hasn't handled the cross-call yet, which is corroborated by the fact that he has a non-NULL msgbox. Oddly, though, he's at IPL 10. He's on the high-level interrupt stack based on cpu_intr_active, which is surprising given that he's currently at IPL 10. But his stack seems to look like this (manually reconstructed):
> 0xffffff00b8b03e00$C
ffffff00b8b03ef0 panic_idle+0x27()
ffffff00b8b03f70 xc_serv+0x119(0, 0)
ffffff00b8b03fc0 av_dispatch_autovect+0x7c(f0)
ffffff00b8b03ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff00bb0392a0 switch_sp_and_call+0x13()
ffffff00bb0392f0 do_interrupt+0xf2(ffffff00bb039300, ffffff6c063d2028)
ffffff00bb039300 _interrupt+0xba()
ffffff00bb039400 vcpu_enter_guest+0x31c()
ffffff00bb039420 __vcpu_run+0x77()
ffffff00bb039460 kvm_arch_vcpu_ioctl_run+0x110()
ffffff00bb039c70 kvm_ioctl+0x11c()
ffffff00bb039cb0 cdev_ioctl+0x45(3400000015, 2000ae80, 0, 202003, ffffff4a853a1640,
ffffff00bb039d94)
ffffff00bb039cf0 spec_ioctl+0x5a(ffffff19e3102540, 2000ae80, 0, 202003, ffffff4a853a1640,
ffffff00bb039d94, 0)
ffffff00bb039d70 fop_ioctl+0x7b(ffffff19e3102540, 2000ae80, 0, 202003, ffffff4a853a1640,
ffffff00bb039d94, 0)
ffffff00bb039e70 ioctl+0x18e(e, 2000ae80, 0)
ffffff00bb039f00 dtrace_systrace_syscall+0x11a(e, 2000ae80, 0, 0, fffffd7fff2d7484, 2991ef9d
)
ffffff00bb039f10 sys_syscall+0x17a()
The presence of panic_idle() explains why he could be on the high-level interrupt stack (because he took the panic_idle() cross-call) but wound up at IPL 10 (because panic_idle() calls splx(10)).
However, we know from CPU 0 that we can take a cross-call interrupt while spinning in panic_idle, so we would expect CPU 11 to see the pending IPI and start handling it, but it doesn't. More analysis needed.
Note that all of the relevant facts above have been seen in two out of two dumps examined. One next step would be to take a quick look at the other dumps and see if they're in the same state:
1 CPU with a stack like CPU 11's at IPL 10, having come through panic_idle(), and with a non-NULL xc_msgbox
1 CPU performing a synchronous cross-call at IPL 15
All other CPUs at IPL 15 spinning in xc_serv+0x136 with xc_wait_cnt == 1 and xc_msgbox == NULL
One hypothesis is that the hardware may have dropped the IPI while executing inside the guest.
Former user commented on 2011-11-22T12:30:25.000-0500:
Analysis of the dumps:
==== 46VY9R1/vmdump.0 ====
panic_hrtime = 0x38e3599cfb0d ; panic_cpu = 0x2 ; panic_ipl = 0xf
enablings = 12 ; kvm = 9 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 3690a9162ffe 38e2eef2cf4e 0
0x1 0x1 0xf 0xf 0x28000 3690a9181f15 38e2eef2d37b 0
0x2 0x1 0xc 0xf 0x18004 3690a9162ffe 3690a87fdd0f 0
0x3 0x1 0xf 0xf 0x28000 3690a9183e92 38e2eef2d514 0
0x4 0x1 0xf 0xf 0x28004 3690a9162ffe 38e2eef2d686 0
0x5 0x2 0 0 0 3690a9189abf 3690a7e92204 0xffffff1998b044c0
0x6 0x1 0xf 0xf 0x28000 3690a8e3795a 38e2eef2db42 0
0x7 0x1 0xf 0xf 0x28000 3690a9164700 38e2eef2deee 0
0x8 0x1 0xf 0xf 0x28000 3690a9162ffe 38e2eef2e0c2 0
0x9 0x1 0xf 0xf 0x28000 3690a918dfc9 38e2eef2e3fd 0
0xa 0x1 0xf 0xf 0x28000 3690a8fcef6f 38e2eef2e545 0
0xb 0x1 0xf 0xf 0x28000 3690a88e7c60 38e2eef2e997 0
0xc 0x11 0xf 0xf 0x18004 3690a8d6d4d3 38e2eef2eb96 0
0xd 0x1 0xf 0xf 0x28000 3690a88cf2c0 38e2eef2ef6f 0
0xe 0x1 0xf 0xf 0x28000 3690a8ebc06a 38e2eef2f222 0
0xf 0x1 0xf 0xf 0x28000 3690a9164700 38e2eef2f5d5 0
==== 46VY9R1/vmdump.1 ====
panic_hrtime = 0x86d2e406bb1 ; panic_cpu = 0x1 ; panic_ipl = 0xf
enablings = 7 ; kvm = 10 ; pic_get_irq = 1 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 7f88147181e 86d2e405252 0
0x1 0x1 0xc 0xf 0x18000 7f881472b0f 7f88127646f 0
0x2 0x1 0xf 0xf 0x28000 7f8814739ce 86d2e405423 0
0x3 0x1 0xf 0xf 0x28000 7f881474e74 86d2e40561a 0
0x4 0x1 0xf 0xf 0x28000 7f881475abc 86d2e405a1c 0
0x5 0x1 0xf 0xf 0x28000 7f881476d45 86d2e405bca 0
0x6 0x1 0xf 0xf 0x28000 7f8814781b0 86d2e405ef2 0
0x7 0x1 0xf 0xf 0x28000 7f881479607 86d2e40614a 0
0x8 0x1 0xf 0xf 0x28000 7f88147a0cd 86d2e406351 0
0x9 0x1 0xf 0xf 0x28000 7f88147b211 86d2e406546 0
0xa 0x1 0xf 0xf 0x28000 7f8812dcca0 86d2e406816 0
0xb 0x1 0xf 0xf 0x28000 7f881333c3c 86d2e406a78 0
0xc 0x1 0xf 0xf 0x28000 7f88147d96a 86d2e406d33 0
0xd 0x11 0xf 0xf 0x18004 7f88147ea52 86d2e407014 0
0xe 0x1 0xf 0xf 0x28000 7f88147f030 86d2e4072da 0
0xf 0x1 0xa 0xa 0x18000 86f09742a40 86ef9dd8f7c 0xffffff1998bf4af8
==== 586Y9R1/vmdump.0 ====
panic_hrtime = 0x3487528c7b60 ; panic_cpu = 0x9 ; panic_ipl = 0xf
enablings = 12 ; kvm = 3 ; pic_get_irq = 0 ; dtrace_sync = 2
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 3401c5c94bf7 3487528cae56 0
0x1 0x2 0xf 0xf 0x18000 3401c490fd2c 3487528cb0ff 0
0x2 0x1 0xf 0xf 0x28000 3401c5c970fd 3487528cb3c4 0
0x3 0x1 0xf 0xf 0x28000 3401c5c981ca 3487528cb5ff 0
0x4 0x1 0xf 0xf 0x28000 3401c5c98e0e 3487528cb945 0
0x5 0x1 0xf 0xf 0x28000 3401c5c9a0e6 3487528cbaf2 0
0x6 0x1 0xf 0xf 0x28000 3401c5c9ad5e 3487528cbef3 0
0x7 0x1 0xf 0xf 0x28000 3401c5c9c0f0 3487528cc085 0
0x8 0x1 0xf 0xf 0x28000 3401c5c9d11b 3487528cc3c4 0
0x9 0x1 0xc 0xf 0x18000 3401c5c9e375 34015b15a3d9 0
0xa 0x1 0xf 0xf 0x28000 3401c5c9e919 3487528cc682 0
0xb 0x1 0xf 0xf 0x28000 3401c5c9fc8d 3487528cc844 0
0xc 0x1 0xf 0xf 0x28000 3401c5ca0c1c 3487528ccb96 0
0xd 0x11 0xf 0xf 0x18004 3401c5ca2088 3487528ccd92 0
0xe 0x1 0xf 0xf 0x28000 3401c5ca2aa2 3487528cd04d 0
0xf 0x3 0 0 0 3401c490fd2c 34015822721f 0xffffff1998bf7af8
==== 89FY9R1/vmdump.0 ====
panic_hrtime = 0x385e39db1ea0 ; panic_cpu = 0x8 ; panic_ipl = 0xf
enablings = 7 ; kvm = 11 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x11 0xf 0xf 0x18004 38274c52d782 385e39db8853 0
0x1 0x2 0xf 0xf 0x18000 38274c52e5e7 385e39db8c7a 0
0x2 0x1 0xf 0xf 0x28000 38274c52fe15 385e39db8e3a 0
0x3 0x1 0xf 0xf 0x28000 38274c530e24 385e39db8e8f 0
0x4 0x1 0xf 0xf 0x28000 38274c53221b 385e39db9307 0
0x5 0x1 0xf 0xf 0x28000 38274c4eee50 385e39db9415 0
0x6 0x1 0xf 0xf 0x28000 38274c53368f 385e39db9754 0
0x7 0x1 0xf 0xf 0x28000 38274c5380fd 385e39db972c 0
0x8 0x1 0xc 0xf 0x18000 38274c5391eb 38274c4e45fc 0
0x9 0x1 0xf 0xf 0x28000 38274c539ffd 385e39db9a32 0
0xa 0x1 0xf 0xf 0x28000 38274c53b4ac 385e39db9d35 0
0xb 0x1 0xf 0xf 0x28000 38274c53bead 385e39db9ef0 0
0xc 0x1 0xf 0xf 0x28000 38274c53d970 385e39dba152 0
0xd 0x3 0 0 0 38274c2ed4a6 38274bb79688 0xffffff199899fb98
0xe 0x1 0xf 0xf 0x28000 38274c53ee19 385e39dba446 0
0xf 0x1 0xf 0xf 0x28000 38274c53fee4 385e39dba720 0
==== B9FY9R1/vmdump.0 ====
panic_hrtime = 0x24e15ce965cc ; panic_cpu = 0x4 ; panic_ipl = 0xf
enablings = 2 ; kvm = 9 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x2c000 239975017c00 24e0f23f5d27 0
0x1 0x1 0xf 0xf 0x28000 239975230776 24e0f23f6082 0
0x2 0x1 0xf 0xf 0x28000 23997532f8b5 24e0f23f658a 0
0x3 0x1 0xf 0xf 0x28004 23997505066f 24e0f23f658e 0
0x4 0x1 0xc 0xf 0x18004 2399759a1280 23990a57aeef 0
0x5 0x1 0xf 0xf 0x28000 2399762b81f2 24e0f23f6830 0
0x6 0x1 0xf 0xf 0x28004 239976671cbc 24e0f23f6c71 0
0x7 0x1 0xf 0xf 0x28004 239976a2b786 24e0f23f6d81 0
0x8 0x1 0xf 0xf 0x28000 239976de5250 24e0f23f6f79 0
0x9 0x1 0xf 0xf 0x28000 23997719ed1a 24e0f23f72e3 0
0xa 0x11 0xf 0xf 0x18004 2399759296c2 24e0f23f74f5 0
0xb 0x1 0xf 0xf 0x28000 2399779122ae 24e0f23f776e 0
0xc 0x1 0xf 0xf 0x28000 239977ccbd78 24e0f23f78bb 0
0xd 0x3 0 0 0 239978085842 23995b5f1e80 0xffffff1998b04a10
0xe 0x1 0xf 0xf 0x28000 23997843f30c 24e0f23f7f74 0
0xf 0x2 0xf 0xf 0x18000 2399787f8dd6 24e0f23f8044 0
==== B9FY9R1/vmdump.1 ====
panic_hrtime = 0xab094f4d7bdf ; panic_cpu = 0xc ; panic_ipl = 0xf
enablings = 2 ; kvm = 11 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x2c000 661f9f24ac00 ab08e4a40873 0
0x1 0x1 0xf 0xf 0x28000 661f9f6046ca ab08e4a40b7a 0
0x2 0x1 0xf 0xf 0x28000 661f9f9be194 ab08e4a40ecd 0
0x3 0x1 0xf 0xf 0x28004 661f9fd77c5e ab08e4a40f2a 0
0x4 0x1 0xf 0xf 0x28000 661f9f253ca0 ab08e4a4103f 0
0x5 0x1 0xf 0xf 0x28000 661fa04eb1f2 ab08e4a4139f 0
0x6 0x11 0xf 0xf 0x18004 661f9faa9d92 ab08e4a4175c 0
0x7 0x1 0xf 0xf 0x28000 661f9f4e359b ab08e4a4193b 0
0x8 0x3 0 0 0 661fa1018250 661f2ded836b 0xffffff1998b24b90
0x9 0x1 0xf 0xf 0x28000 661fa13d1d1a ab08e4a41c2f 0
0xa 0x2 0xf 0xf 0x18000 661f9f24ac00 ab08e4a42001 0
0xb 0x1 0xf 0xf 0x28000 661f9f3eefc4 ab08e4a4218b 0
0xc 0x1 0xc 0xf 0x18004 661fa1efed78 661f347b7e71 0
0xd 0x1 0xf 0xf 0x28000 661fa22b8842 ab08e4a4230c 0
0xe 0x1 0xf 0xf 0x28000 661fa267230c ab08e4a42530 0
0xf 0x1 0xf 0xf 0x28000 661fa2a2bdd6 ab08e4a429c5 0
==== B9FY9R1/vmdump.2 ====
panic_hrtime = 0x425d352b36e4 ; panic_cpu = 0x3 ; panic_ipl = 0xf
enablings = 2 ; kvm = 13 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x2c000 7fcb7fd8c00 425cca810569 0
0x1 0x3 0 0 0 7fcb46cc87e 7fca61c6c5b 0xffffff1998bf7ae8
0x2 0x1 0xf 0xf 0x28000 7fcb874c194 425cca810a01 0
0x3 0x1 0xc 0xf 0x18000 7fcb8b05c5e 7fcb7fdaee8 0
0x4 0x1 0xf 0xf 0x2c000 7fcb8ebf728 425cca810c0d 0
0x5 0x1 0xf 0xf 0x28000 7fcb92791f2 425cca810d96 0
0x6 0x1 0xf 0xf 0x28004 7fcb9632cbc 425cca81116d 0
0x7 0x1 0xf 0xf 0x28000 7fcb99ec786 425cca8113ca 0
0x8 0x1 0xf 0xf 0x28000 7fcb9da6250 425cca8116ff 0
0x9 0x2 0xf 0xf 0x18000 7fca9363934 425cca811953 0
0xa 0x1 0xf 0xf 0x28000 7fcb9884745 425cca811bf1 0
0xb 0x1 0xf 0xf 0x28000 7fcba8d32ae 425cca811e60 0
0xc 0x1 0xf 0xf 0x28004 7fcb8962280 425cca8120fb 0
0xd 0x1 0xf 0xf 0x28004 7fcbaf87c80 425cca8122af 0
0xe 0x1 0xf 0xf 0x28000 7fcb9c74f80 425cca812672 0
0xf 0x11 0xf 0xf 0x18004 7fcb80cce40 425cca812856 0
==== BYLY9R1/vmdump.1 ====
panic_hrtime = 0x3782b39f3cba ; panic_cpu = 0x2 ; panic_ipl = 0xf
enablings = 12 ; kvm = 14 ; pic_get_irq = 0 ; dtrace_sync = 3
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 3569e52cf339 3782b39f9f67 0
0x1 0x11 0xf 0xf 0x18004 3569e50b4d50 3782b39fa2b9 0
0x2 0x1 0xc 0xf 0x18000 3569e52d0a08 35697a5bde31 0
0x3 0x1 0xf 0xf 0x18004 3569e52d1d23 3782b39fa5dc 0
0x4 0x1 0xf 0xf 0x28000 3569e52d280e 3782b39fa7d5 0
0x5 0x1 0xf 0xf 0x28000 3569e51cbe4b 3782b39fab10 0
0x6 0x1 0xf 0xf 0x28000 3569e52d4a2d 3782b39fad53 0
0x7 0x1 0xf 0xf 0x28000 3569e52d5fc4 3782b39fb0d3 0
0x8 0x1 0xf 0xf 0x28000 3569e52d9ee5 3782b39fb2e9 0
0x9 0x1 0xf 0xf 0x28000 3569e52db36b 3782b39fb631 0
0xa 0x1 0xf 0xf 0x28000 3569e507f080 3782b39fb6d6 0
0xb 0x1 0xf 0xf 0x28000 3569e52dd351 3782b39fbaac 0
0xc 0x1 0xf 0xf 0x28000 3569e52de2c5 3782b39fbd1a 0
0xd 0x1 0xa 0xa 0x18000 378b023f6880 3785956309be 0xffffff1998bf7f58
0xe 0x1 0xf 0xf 0x28000 3569e52dfd42 3782b39fc195 0
0xf 0x1 0xf 0xf 0x28000 3569e52e0e82 3782b39fc574 0
==== BYLY9R1/vmdump.2 ====
panic_hrtime = 0x102e52e1ea4 ; panic_cpu = 0 ; panic_ipl = 0xf
enablings = 12 ; kvm = 15 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xc 0xf 0x18000 45d8e42f49 456e2342f4 0
0x1 0x1 0xf 0xf 0x28000 45d8d0c0d3 102e52dfc9c 0
0x2 0x1 0xf 0xf 0x28000 45d8d23c00 102e52dff2e 0
0x3 0x1 0xf 0xf 0x28000 45d8e46aef 102e52e02aa 0
0x4 0x1 0xf 0xf 0x28004 45d8e48473 102e52e058c 0
0x5 0x1 0xf 0xf 0x28000 45d8e4e818 102e52e05c7 0
0x6 0x1 0xf 0xf 0x28004 45d8e4fb99 102e52e0b8c 0
0x7 0x1 0xf 0xf 0x28000 45d8e51659 102e52e0b33 0
0x8 0x11 0xf 0xf 0x18004 45d8d2249e 102e52e0c93 0
0x9 0x1 0xf 0xf 0x28000 45d8e54f49 102e52e0f27 0
0xa 0x1 0xf 0xf 0x28000 45d8cf2f10 102e52e109c 0
0xb 0x1 0xf 0xf 0x28000 45d8ceb67f 102e52e136f 0
0xc 0x1 0xf 0xf 0x28000 45d8e5894e 102e52e161b 0
0xd 0x1 0xf 0xf 0x28000 45d8e59560 102e52e1964 0
0xe 0x2 0 0 0 45d8e5a623 456d8d05c7 0xffffff1998b04da0
0xf 0x1 0xf 0xf 0x28000 45d8d36e12 102e52e1d4c 0
==== BYLY9R1/vmdump.3 ====
panic_hrtime = 0xcc84656132 ; panic_cpu = 0x8 ; panic_ipl = 0xf
enablings = 7 ; kvm = 16 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x2c000 b59b66d1c1 cc19bb2670 0
0x1 0x1 0xf 0xf 0x28000 b59b142f80 cc19bb2a0b 0
0x2 0x1 0xf 0xf 0x28000 b59b6703f1 cc19bb2c8b 0
0x3 0x1 0xf 0xf 0x28000 b59b0e3320 cc19bb2d91 0
0x4 0x1 0xf 0xf 0x28000 b59b673a50 cc19bb2fd5 0
0x5 0x1 0xf 0xf 0x28000 b59b66e404 cc19bb3249 0
0x6 0x1 0xf 0xf 0x28000 b59b675c9f cc19bb3483 0
0x7 0x1 0xf 0xf 0x28000 b59b67767b cc19bb3757 0
0x8 0x11 0xc 0 0x4 b59b16be00 b52dfe373d 0
0x9 0x1 0xf 0xf 0x28000 b59b67a6ac cc19bb3a1b 0
0xa 0x1 0xf 0xf 0x28004 b59b67c459 cc19bb3d88 0
0xb 0x1 0xf 0xf 0x28000 b59b5d9992 cc19bb4053 0
0xc 0x1 0xf 0xf 0x28000 b59b67f4b7 cc19bb3ff4 0
0xd 0x1 0xf 0xf 0x28000 b59b680a6d cc19bb430b 0
0xe 0x1 0xf 0xf 0x28000 b59b3eb00b cc19bb4532 0
0xf 0x1 0xa 0xa 0x18000 cdcb10a868 cf08dba8d4 0xffffff1998b04d80
==== BYLY9R1/vmdump.4 ====
panic_hrtime = 0x2d7109c056c ; panic_cpu = 0xa ; panic_ipl = 0xf
enablings = 2 ; kvm = 15 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28400 2627bf98280 2d6a5f1b51c 0
0x1 0x1 0xf 0xf 0x28000 2627b9c86ca 2d6a5f1b8c7 0
0x2 0x1 0xf 0xf 0x28000 2627b63f940 2d6a5f1bb03 0
0x3 0x1 0xf 0xf 0x28000 2627bbbcd57 2d6a5f1be37 0
0x4 0x1 0xf 0xf 0x28000 2627c4f5728 2d6a5f1c007 0
0x5 0x1 0xf 0xf 0x28000 2627b942b3b 2d6a5f1c3d7 0
0x6 0x1 0xf 0xf 0x28000 2627b61a4e0 2d6a5f1c4a3 0
0x7 0x1 0xf 0xf 0x28000 2627b97eb8f 2d6a5f1c8ba 0
0x8 0x1 0xf 0xf 0x28000 2627bce6f8a 2d6a5f1ca1a 0
0x9 0x1 0xf 0xf 0x28000 2627d795d1a 2d6a5f1cde6 0
0xa 0x11 0xc 0 0x4 2627bf79952 26271db383f 0
0xb 0x1 0xf 0xf 0x28000 2627baa4888 2d6a5f1d07f 0
0xc 0x3 0 0 0 26278fe9200 2626e465db8 0xffffff1998b04a00
0xd 0x1 0xf 0xf 0x28000 2627b625880 2d6a5f1d4e8 0
0xe 0x2 0xf 0xf 0x18000 26278ab2ca8 2d6a5f1d84d 0
0xf 0x1 0xf 0xf 0x28000 2627b6c4370 2d6a5f1d939 0
==== CYLY9R1/vmdump.0 ====
panic_hrtime = 0x3ff7fe51d621 ; panic_cpu = 0x6 ; panic_ipl = 0xf
enablings = 14 ; kvm = 14 ; pic_get_irq = 0 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 3fb49f85d9cc 3ff7fe521de7 0
0x1 0x1 0xf 0xf 0x28000 3fb49f85e8e7 3ff7fe522197 0
0x2 0x11 0xf 0xf 0x18004 3fb49f7c14ba 3ff7fe5222f3 0
0x3 0x1 0xf 0xf 0x28000 3fb49f8607d0 3ff7fe5226c7 0
0x4 0x1 0xf 0xf 0x28000 3fb49f8619e1 3ff7fe5228e3 0
0x5 0x1 0xf 0xf 0x28000 3fb49f86270a 3ff7fe522a76 0
0x6 0x1 0xc 0xf 0x18000 3fb49f863922 3fb49f5b2e5d 0
0x7 0x1 0xf 0xf 0x28000 3fb49f620c76 3ff7fe522cc4 0
0x8 0x1 0xf 0xf 0x28000 3fb49f6adddf 3ff7fe522eb4 0
0x9 0x1 0xf 0xf 0x28000 3fb49f682dac 3ff7fe52325a 0
0xa 0x1 0xf 0xf 0x28000 3fb49f66ccc0 3ff7fe523462 0
0xb 0x1 0xa 0xa 0x18000 3fff86c2d2ad 3fff86c34525 0xffffff199899f2a8
0xc 0x1 0xf 0xf 0x28000 3fb49f869b62 3ff7fe523afa 0
0xd 0x1 0xf 0xf 0x28000 3fb49f86a685 3ff7fe523b2e 0
0xe 0x1 0xf 0xf 0x28000 3fb49f86b6e5 3ff7fe523c76 0
0xf 0x1 0xf 0xf 0x28000 3fb49f86c0e8 3ff7fe524027 0
==== F5VY9R1/vmdump.0 ====
panic_hrtime = 0x39a4b9043c29 ; panic_cpu = 0xa ; panic_ipl = 0xf
enablings = 14 ; kvm = 9 ; pic_get_irq = 1 ; dtrace_sync = 1
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x1 0xf 0xf 0x28000 38edc8770885 39a44e5a6808 0
0x1 0x1 0xf 0xf 0x28000 38edc86537e0 39a44e5a6b55 0
0x2 0x1 0xf 0xf 0x28004 38edc8a0e836 39a44e5a6de5 0
0x3 0x1 0xf 0xf 0x28000 38edc8a0f63c 39a44e5a711c 0
0x4 0x1 0xf 0xf 0x28004 38edc8a10c9a 39a44e5a728a 0
0x5 0x1 0xf 0xf 0x28000 38edc8a11691 39a44e5a75db 0
0x6 0x1 0xf 0xf 0x28000 38edc8a12ffd 39a44e5a7875 0
0x7 0x1 0xf 0xf 0x28000 38edc8920d6d 39a44e5a7bfc 0
0x8 0x1 0xf 0xf 0x28000 38edc8a15422 39a44e5a7e81 0
0x9 0x1 0xf 0xf 0x28000 38edc869c398 39a44e5a8190 0
0xa 0x1 0xc 0xf 0x18004 38edc8a17861 38ed5db70a81 0
0xb 0x1 0xf 0xf 0x28000 38edc860d105 39a44e5a845a 0
0xc 0x11 0xf 0xf 0x18004 38edc8a195d6 39a44e5a87b3 0
0xd 0x1 0xf 0xf 0x28000 38edc8a1a0d4 39a44e5a898d 0
0xe 0x1 0xf 0xf 0x28000 38edc89a63fc 39a44e5a8b67 0
0xf 0x2 0 0xb 0 38edc8633b99 38ed5d217306 0xffffff1998b04400
==== H9FY9R1/vmdump.0 ====
panic_hrtime = 0x37813f5168b8 ; panic_cpu = 0x2 ; panic_ipl = 0xf
enablings = 18 ; kvm = 8 ; pic_get_irq = 0 ; dtrace_sync = 2
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x3 0 0 0 36374059a9d3 3636d5106270 0xffffff1998bf7e18
0x1 0x2 0xf 0xf 0x28000 36374059bd90 3780d4a7c6d2 0
0x2 0x2 0xc 0xf 0x18000 36374059c9e4 3636d5a75e49 0
0x3 0x12 0xf 0xf 0x18004 36374059ddff 3780d4a7ca4a 0
0x4 0x2 0xf 0xf 0x28000 36374059e92a 3780d4a7cc50 0
0x5 0x2 0xf 0xf 0x28000 36374059ffa5 3780d4a7cfd4 0
0x6 0x2 0xf 0xf 0x28000 3637405a0d33 3780d4a7d1c9 0
0x7 0x2 0xf 0xf 0x28000 3637405a2148 3780d4a7d51a 0
0x8 0x2 0xf 0xf 0x28000 3637405a279a 3780d4a7d78c 0
0x9 0x2 0xf 0xf 0x28000 3637405a3965 3780d4a7da83 0
0xa 0x2 0xf 0xf 0x28000 3637405a4574 3780d4a7dc6a 0
0xb 0x12 0xf 0xf 0x18004 3637405a5d7d 3780d4a7df2f 0
0xc 0x2 0xf 0xf 0x28000 3637405a67e7 3780d4a7e10a 0
0xd 0x2 0xf 0xf 0x28000 3637405a79b9 3780d4a7e43e 0
0xe 0x2 0xf 0xf 0x28000 3637405a8993 3780d4a7e688 0
0xf 0x2 0xf 0xf 0x28000 3637405a9bb3 3780d4a7ea41 0
==== H9FY9R1/vmdump.1 ====
panic_hrtime = 0x1230d58ad5a ; panic_cpu = 0x6 ; panic_ipl = 0xf
enablings = 17 ; kvm = 9 ; pic_get_irq = 1 ; dtrace_sync = 2
CPU WORK PRI IPL ACTV CYCEXPIRE LAST-15 MBOX
0 0x2 0xf 0xf 0x28000 8fdce184a9 122a2adfc49 0
0x1 0x2 0xf 0xf 0x28000 8fdcacf97f 122a2ae0030 0
0x2 0x2 0xf 0xf 0x28000 8fdce1b047 122a2ae02d5 0
0x3 0x2 0xf 0xf 0x28000 8fdce1c1cb 122a2ae03b1 0
0x4 0x2 0xf 0xf 0x28000 8fdce1d265 122a2ae083e 0
0x5 0x2 0xf 0xf 0x28004 8fdce23343 122a2ae0871 0
0x6 0x2 0xc 0xf 0x18000 8fdc9c1d00 8fdc9727bd 0
0x7 0x2 0xf 0xf 0x28000 8fdccb2fde 122a2ae0a90 0
0x8 0x2 0xf 0xf 0x28000 8fdcbfa00b 122a2ae0c59 0
0x9 0x2 0xf 0xf 0x28000 8fdce27b18 122a2ae0fc6 0
0xa 0x12 0xf 0xf 0x18004 8fdce28b93 122a2ae136b 0
0xb 0x2 0xf 0xf 0x28000 8fdce29e6d 122a2ae1397 0
0xc 0x2 0xa 0xa 0x18000 12533f38d16 1253f7c7969 0xffffff1998b04260
0xd 0x12 0xf 0xf 0x18004 8fdce3276a 122a2ae17a0 0
0xe 0x2 0xf 0xf 0x28000 8fdce36e97 122a2ae1972 0
0xf 0x2 0xf 0xf 0x28000 8fdc9f3451 122a2ae1c7a 0
Former user commented on 2011-11-22T13:05:36.000-0500:
Adding target fix version of 6.5.2 so that if we can fix we will deliver in that release
Former user commented on 2011-11-22T21:10:42.000-0500:
Bryan found a path in KVM that can block with interrupts enabled. We call cli() in vcpu_enter_guest, but then call inject_pending_event, which calls kvm_cpu_get_interrupt, which calls kvm_pic_read_irq, which enters an adaptive mutex. However, we weren't able to observe this path with DTrace on bh1-kvm0 nor two of the JPC machines that had hit this (BYLY9R1 and H9FY9R1).
I was able to reproduce this several times on bh1-kvm0. I used the following procedure, though the specifics that actually produced it differed slightly each time:
(1) CA load should be already applied.
(2) Begin MySQL load on a smartosplus64 zone:
# zlogin f384195d-c007-4375-8424-36c5d44f6791 # while :; do sysbench --db-driver=mysql --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=100000 --mysql-socket=/tmp/mysql.sock --my
sql-password=9u2bu93uoz --oltp-read-only run ; done & # while :; do sysbench --db-driver=mysql --num-threads=16 --max-requests
=100000 --test=oltp --oltp-table-size=100000 --mysql-socket=/tmp/mysql.sock --my
sql-password=9u2bu93uoz --oltp-read-only run ; done &
(3) Run DTrace:
# dtrace -n 'pic_update_irq:entry{chill(100000);}' -w
Now the way I actually triggered it varied a bit. The first time, I rebooted the VM:
# ssh root@10.2.120.25
root@dcf07f28-fe89-4564-ab12-c625795e9664:~# reboot
I repeated this three times, and then rebooted again and entered the GRUB menu. As I was typing, the system locked up (as desired).
A few other times, I was able to simply reboot the VM, set "noapic nolapic" in the GRUB prompt, and boot.
We tried fixing this by replacing the adaptive lock with a spinlock, but we still saw the system lock up. More analysis needed.
Former user commented on 2011-11-24T03:55:42.000-0500:
As it turns out, there are two separate pathologies here: one is a CPU that locks up and can no longer serve cross-calls and never enters panic_idle(); the other is a CPU that appears to no longer serve cross-calls, but does in fact enter panic_idle() after the system has panicked. As it turns out, these are two different manifestations of the same problem: trying to acquire a lock (the mutex associated with the PIC) with interrupts disabled (specifically, in vcpu_enter_guest()). If we attempt to acquire the lock and the owner is not running (or has been pinned by a lower-level interrupt), we will block – and we will leave the CPU with interrupts disabled. This will lead to the first pathology. If, however, the owner appears to be running, we will spin – but if the owner is in fact pinned by a level-15 which is in turn blocked on the CPU's ability to field an interrupt, we have deadlocked the system. This case is more complicated because when the system is NMI'd, panicstr will be set – and at that moment, the spinning CPU will blow through the lock. At some point (namely, shortly after entering the guest), interrupts will be reenabled – and at that point, the CPU will take the cross call from the panic and will enter panic_idle(). This gives the impression that the CPU was running (as it was able to serve the cross-call from the panic), but this is merely an illusion created by the act of panicking itself.
In terms of fixing this, we attempted to turn the lock in question into a spin-lock – but this still led to a hang because kvm_vcpu_kick() is called with the lock held (and can therefore dead-lock on a dispatcher lock). The proper solution is therefore to move the disabling of interrupts past the call to inject_pending_event(); with this change, we have been unable to reproduce the problem.
Former user commented on 2011-11-24T03:58:47.000-0500:
I have attached the script that runs mdb to perform the analysis of the dumps shown above. The tell-tale symptoms will be (1) every CPU except 1 or 2 at high IPL (0xf), (2) one CPU showing an outstanding mbox and (3) "inject" being non-zero or "irq" being non-zero.
Linda Derezinski commented on 2011-11-28T04:48:34.000-0500:
89FY9R1 hung hard again (no love console, ssh, UR) NMI didn't work so had to power cycle.
Former user commented on 2011-11-28T16:45:20.000-0500:
kvm commit a131c2d (branch master, by Robert Mustacchi)
OS-745#icft=OS-745 system hung doing crosscalls
Former user commented on 2011-11-28T16:47:49.000-0500:
kvm commit 007d339 (branch release-20110901, by Robert Mustacchi)
OS-745#icft=OS-745 system hung doing crosscalls
Former user commented on 2011-11-30T16:53:51.000-0500:
For future reference I'm attaching prod_state_before.txt, a log of "cactl summary" on the datacenter during the outage.