Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Open |
Created at: | 2015-02-13T06:43:59.000Z |
Updated at: | 2018-05-07T17:51:24.370Z |
Created by: | Former user |
---|---|
Reported by: | Josh Wilsdon |
I had my COAL panic while I had two sessions running. In one session I was running the vmadm tests, in the other I was deleting 23 VMs (serially) from failed runs of previous tests (while I had been making changes).
When it got to the 23rd VM the kernel panicked.
When it came back up I tried again to delete the VM and it panicked again. It also panicked when I tried to destroy the zfs filesystem from the zone manually with `zfs destroy`.
I was able to `zfs send` a new snapshot of that dataset and receive it with a different name and destroy that.
The dump from the first panic is at:
/Joyent_Dev/stor/dumps/zfs_bpobj_assertion/vmdump.0
The initial data I got was:
> ::status debugging crash dump vmcore.0 (64-bit) from headnode operating system: 5.11 joyent_20150212T122955Z (i86pc) image uuid: (not set) panic message: assertion failed: 0 == dmu_object_info(os, bpo.bpo_phys->bpo_subobjs, &doi) (0x0 == 0x2), file: ../../common/fs/zfs/bpobj.c, line: 109 dump content: kernel pages only > ::stack vpanic() 0xfffffffffba6b06d() bpobj_free+0x282(ffffff0254c30ac0, ee8, ffffff025f493e80) dsl_deadlist_free+0x83(ffffff0254c30ac0, edf, ffffff025f493e80) dsl_destroy_head_sync_impl+0xe2(ffffff039fd18600, ffffff025f493e80) dsl_destroy_head_sync+0x73(ffffff00113f1b50, ffffff025f493e80) dsl_sync_task_sync+0x10a(ffffff00113f1a50, ffffff025f493e80) dsl_pool_sync+0x28b(ffffff0254bf7980, 26cb) spa_sync+0x27e(ffffff0254bf2000, 26cb) txg_sync_thread+0x227(ffffff0254bf7980) thread_start+8() >
For some reason, we've gotten ENOENT from dmu_dnode_hold_impl, which is what caused us to get 0x2 and thus fail the assertion. Looking at the stack and doing a bit of grovelling, I was able to find what looks suspiciously like our bpobj_t:
> ffffff000c78e6f0-100,100/nap ! less > 0xffffff000c78e660::print bpobj_t { bpo_lock = { _opaque = [ 0xffffff000c78ec40 ] } bpo_os = 0xffffff0254c30ac0 bpo_object = 0xee8 bpo_epb = 0x400 bpo_havecomp = 0x1 bpo_havesubobj = 0x1 bpo_phys = 0xffffff028c78a000 bpo_dbuf = 0xffffff0269a07a30 bpo_cached_dbuf = 0 }
This gives us a subobject that we're trying to get dmu_object_info on of:
> 0xffffff000c78e660::print bpobj_t bpo_phys->bpo_subobjs bpo_phys->bpo_subobjs = 0xeea
And if we try to find this dbuf with the ::dbufs dcmd
> ::dbufs -O ffffff0254c30ac0 -o 0xeea >
We get nothing. So that explains the ENOENT, but that seems like it's a fundamental violation of what the code expects. So why did we lose this dbuf which is a subopj of the current child. Really, quite odd.