OS-3838: panic: 'assertion failed: 0 == dmu_object_info(os, bpo.bpo_phys->bpo_subobjs, &doi) (0x0 == 0x2), file: ../../common/fs/zfs/bpobj.c, line: 109'

Details

Issue Type:Bug
Priority:4 - Normal
Status:Open
Created at:2015-02-13T06:43:59.000Z
Updated at:2018-05-07T17:51:24.370Z

People

Created by:Former user
Reported by:Josh Wilsdon

Description

I had my COAL panic while I had two sessions running. In one session I was running the vmadm tests, in the other I was deleting 23 VMs (serially) from failed runs of previous tests (while I had been making changes).

When it got to the 23rd VM the kernel panicked.

When it came back up I tried again to delete the VM and it panicked again. It also panicked when I tried to destroy the zfs filesystem from the zone manually with `zfs destroy`.

I was able to `zfs send` a new snapshot of that dataset and receive it with a different name and destroy that.

The dump from the first panic is at:

/Joyent_Dev/stor/dumps/zfs_bpobj_assertion/vmdump.0

Comments

Comment by Josh Wilsdon
Created at 2015-02-13T06:45:07.000Z
Updated at 2018-05-07T17:51:24.349Z

The initial data I got was:

> ::status
debugging crash dump vmcore.0 (64-bit) from headnode
operating system: 5.11 joyent_20150212T122955Z (i86pc)
image uuid: (not set)
panic message: assertion failed: 0 == dmu_object_info(os, bpo.bpo_phys->bpo_subobjs, &doi) (0x0 == 0x2), file: ../../common/fs/zfs/bpobj.c, line: 109
dump content: kernel pages only
> ::stack
vpanic()
0xfffffffffba6b06d()
bpobj_free+0x282(ffffff0254c30ac0, ee8, ffffff025f493e80)
dsl_deadlist_free+0x83(ffffff0254c30ac0, edf, ffffff025f493e80)
dsl_destroy_head_sync_impl+0xe2(ffffff039fd18600, ffffff025f493e80)
dsl_destroy_head_sync+0x73(ffffff00113f1b50, ffffff025f493e80)
dsl_sync_task_sync+0x10a(ffffff00113f1a50, ffffff025f493e80)
dsl_pool_sync+0x28b(ffffff0254bf7980, 26cb)
spa_sync+0x27e(ffffff0254bf2000, 26cb)
txg_sync_thread+0x227(ffffff0254bf7980)
thread_start+8()
>

Comment by Former user
Created at 2015-02-13T07:07:46.000Z

For some reason, we've gotten ENOENT from dmu_dnode_hold_impl, which is what caused us to get 0x2 and thus fail the assertion. Looking at the stack and doing a bit of grovelling, I was able to find what looks suspiciously like our bpobj_t:

> ffffff000c78e6f0-100,100/nap ! less
> 0xffffff000c78e660::print bpobj_t
{
    bpo_lock = {
        _opaque = [ 0xffffff000c78ec40 ]
    }
    bpo_os = 0xffffff0254c30ac0
    bpo_object = 0xee8
    bpo_epb = 0x400
    bpo_havecomp = 0x1
    bpo_havesubobj = 0x1
    bpo_phys = 0xffffff028c78a000
    bpo_dbuf = 0xffffff0269a07a30
    bpo_cached_dbuf = 0
}

This gives us a subobject that we're trying to get dmu_object_info on of:

> 0xffffff000c78e660::print bpobj_t bpo_phys->bpo_subobjs
bpo_phys->bpo_subobjs = 0xeea

And if we try to find this dbuf with the ::dbufs dcmd

> ::dbufs -O ffffff0254c30ac0 -o 0xeea
>

We get nothing. So that explains the ENOENT, but that seems like it's a fundamental violation of what the code expects. So why did we lose this dbuf which is a subopj of the current child. Really, quite odd.