OS-7319: Kernel BAD TRAP in mac module, mac_protect_check()

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Resolved
Created at:2018-10-22T07:53:49.998Z
Updated at:2018-11-01T18:04:45.156Z

People

Created by:Marsell K
Reported by:Marsell K
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-11-01T18:04:45.142Z)

Fix Versions

2018-11-08 EGOT (Release Date: 2018-11-08)

Related Issues

Description

While building a fresh COAL, I am reliably having the kernel flame out:

> ::status
debugging crash dump vmcore.6 (64-bit) from headnode
operating system: 5.11 joyent_20181019T192419Z (i86pc)
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe0014b1e3a0 addr=67e8afda occurred in module "mac" due to an illegal access to a user address
dump content: kernel pages only
> ::stack
mac_protect_check+0x57(fffffe0c06f8a018, fffffe0c14086060)
mac_tx+0x34a(fffffe0c06f8a018, fffffe0c14086060, c17f6e66, 1, 0)
str_mdata_fastpath_put+0x53(fffffe0c09c12ad0, fffffe0c14086060, c17f6e66, 1)
ip_xmit+0x835(fffffe0c14086060, fffffe0c0dab96a0, 180036060, 5b, c17f6e66, 0)
ire_send_wire_v4+0x401(fffffe0c6082ce48, fffffe0c14086060, fffffe0bd0295234, fffffe0c9add38c0, fffffe0bfb84d6b0)
conn_ip_output+0x190(fffffe0c14086060, fffffe0c9add38c0)
tcp_send_data+0x59(fffffe0bfb733340, fffffe0c14086060)
tcp_output+0x58c(fffffe0bfb733040, fffffe0c14086060, fffffe0bd380ce80, 0)
squeue_enter+0x42c(fffffe0bd380ce80, fffffe0c14086060, fffffe0c14086060, 1, 0, 4)
tcp_sendmsg+0x14f(fffffe0bfb733040, fffffe0c14086060, fffffe0014b1ec90, fffffe0bf98f3bc0)
so_sendmsg+0x247(fffffe0c96a6b028, fffffe0014b1ec90, fffffe0014b1ede0, fffffe0bf98f3bc0)
socket_sendmsg+0x56(fffffe0c96a6b028, fffffe0014b1ec90, fffffe0014b1ede0, fffffe0bf98f3bc0)
socket_vop_write+0x6c(fffffe0c96a6a540, fffffe0014b1ede0, 0, fffffe0bf98f3bc0, 0)
fop_write+0xf3(fffffe0c96a6a540, fffffe0014b1ede0, 0, fffffe0bf98f3bc0, 0)
write+0x250(b, 8f34b90, 27)
write32+0x1e(b, 8f34b90, 27)
_sys_sysenter_post_swapgs+0x153()

Version:

SunOS headnode 5.11 joyent_20181019T192419Z i86pc i386 i86pc

Comments

Comment by Marsell K
Created at 2018-10-22T08:18:19.361Z
Updated at 2018-10-22T09:41:17.617Z

I've uploaded the contents of /var/crash/volatile here: /marsell/public/OS-7319

The COAL VMware image: /marsell/public/OS-7319/coal-master-20181022T022745Z-gaebe5f9b-4gb.vmwarevm.tar.gz It takes a couple minutes after boot for the trap to occur.


Comment by Former user
Created at 2018-10-22T13:30:28.196Z

vmdump.6 in thoth as 4c9985ecdfeeae77dc74e2e0171dc0e0


Comment by Former user
Created at 2018-10-22T14:27:41.851Z
$ thoth debug 4c9985ecdfeeae77dc74e2e0171dc0e0
...
> $C
fffffe0014b1e4d0 mac_protect_check+0x57(fffffe0c06f8a018, fffffe0c14086060)
fffffe0014b1e580 mac_tx+0x34a(fffffe0c06f8a018, fffffe0c14086060, c17f6e66, 1, 0)
fffffe0014b1e5d0 str_mdata_fastpath_put+0x53(fffffe0c09c12ad0, fffffe0c14086060, c17f6e66, 1)
fffffe0014b1e6e0 ip_xmit+0x835(fffffe0c14086060, fffffe0c0dab96a0, 180036060, 5b, c17f6e66, 0, 0, fffffe0c9add3a28)
...
> mac_protect_check+0x57::dis
mac_protect_check+0x37:         testq  %rsi,%rsi
mac_protect_check+0x3a:         je     +0x48    <mac_protect_check+0x84>
mac_protect_check+0x3c:         leaq   -0x38(%rbp),%r13
mac_protect_check+0x40:         jmp    +0x15    <mac_protect_check+0x57>
mac_protect_check+0x42:         nopw   0x0(%rax,%rax)
mac_protect_check+0x48:         testq  %r12,%r12
mac_protect_check+0x4b:         movq   %rbx,0x0(%r13)
mac_protect_check+0x4f:         movq   %rbx,%r13
mac_protect_check+0x52:         movq   %r12,%rbx
mac_protect_check+0x55:         je     +0x29    <mac_protect_check+0x80>
mac_protect_check+0x57:         movq   (%rbx),%r12               <<<<<<<<<<<<
mac_protect_check+0x5a:         movq   %rbx,%rsi
mac_protect_check+0x5d:         movq   $0x0,(%rbx)
mac_protect_check+0x64:         movq   %r14,%rdi
mac_protect_check+0x67:         call   -0x28c   <mac_protect_check_one>
mac_protect_check+0x6c:         testl  %eax,%eax
mac_protect_check+0x6e:         je     -0x28    <mac_protect_check+0x48>
mac_protect_check+0x70:         movq   %rbx,%rdi
mac_protect_check+0x73:         movq   %r12,%rbx
mac_protect_check+0x76:         call   +0x42ae2b5       <freemsg>
mac_protect_check+0x7b:         testq  %r12,%r12
> ::regs
%rax = 0x0000000000000000                 %r9  = 0x000000000a636315
%rbx = 0x0000000067e8afda                 %r10 = 0x00000000ffffffff
%rcx = 0xfffffe0bd029528f                 %r11 = 0x0000000000000020
%rdx = 0xfffffe0014b1e300                 %r12 = 0x0000000067e8afda
%rsi = 0xfffffe0bd029528f                 %r13 = 0xfffffe0c14086060
%rdi = 0x0000000000000000                 %r14 = 0xfffffe0c06f8a018
%r8  = 0xfffffe0014b1e378                 %r15 = 0xfffffe0c06f8a488

%rip = 0xfffffffff7834587 mac_protect_check+0x57
%rbp = 0xfffffe0014b1e4d0
%rsp = 0xfffffe0014b1e490
%rflags = 0x00010202
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,zf,af,pf,cf>

%cs = 0x0030    %ds = 0x004b    %es = 0x004b    %fs = 0x0000
%gs = 0x01c3    %gsbase = 0xfffffffffbc4d000    %kgsbase = 0x0
%trapno = 0xe   %err = 0x0      %cr2 = 0x67e8afda       %cr3 = 0x23fccb000

mac_protect_check+0x57 seems to be line 2378

2377         for (; mp != NULL; mp = next) {
2378                 next = mp->b_next;
2379                 mp->b_next = NULL;
2380
2381                 if (mac_protect_check_one(mcip, mp) == 0) {
2382                         *tailp = mp;
2383                         tailp = &mp->b_next;
2384                 } else {
2385                         freemsg(mp);
2386                 }
2387         }

That would mean that mp->b_next for the mblk_t used in the previous iteration of this loop was 0x0000000067e8afda, which is not a valid kernel address. Likewise b_prev in the mblk_t that was passed in seems to be bogus.

> fffffe0c14086060::print mblk_t
{
    b_next = 0
    b_prev = 0x2eb2c
    b_cont = 0
    b_rptr = 0xfffffe0bd0295226
    b_wptr = 0xfffffe0bd029528f
    b_datap = 0xfffffe0bd0295180
    b_band = 0
    b_tag = 0
    b_flag = 0
    b_queue = 0
}

Comment by Dan McDonald
Created at 2018-10-22T19:15:02.631Z

Those b_next and b_prev assignments are legitimate.  TCP uses them to store sequence number data while being queued up.

What's fishy is that if you look at the stack in the TCP part of it, you see this:

tcp_send_data+0x59(fffffe0bfb733340, fffffe0c14086060 <==== THIS IS ALSO 'mp' not 'dupb(mp)' )
tcp_output+0x58c(fffffe0bfb733040, fffffe0c14086060 (THIS IS 'mp'), fffffe0bd380ce80, 0)

As the inline shouting suggests, tcp_output() should not be calling tcp_send_data with the same mblk passed in.  It always calls dupb(mp), and that the reality here doesn't jive is disturbing.

And yes, tcp_output+0x58c is the instruction-after the call to tcp_send_data():

> tcp_output+0x58c::dis -n 2
tcp_output+0x584:               movq   %rbx,%rdi
tcp_output+0x587:               call   +0x9d4   <tcp_send_data>
tcp_output+0x58c:               jmp    -0x4d4   <tcp_output+0xbd>
tcp_output+0x591:               nopl   0x0(%rax)
tcp_output+0x598:               movq   %r14,%rsi
>

Examining the flow of tcp_output() in source, here's where the b_next and b_prev of the passed-in mblk get set:

1254  	if ((mp1 = dupb(mp)) == 0)
1255  		goto no_memory;
1256  	mp->b_prev = (mblk_t *)(uintptr_t)now;
1257  	mp->b_next = (mblk_t *)(uintptr_t)snxt;

And later, tcp_send_data() gets called with mp1, NOT mp.  But the stack above shows mp1 == mp, which is disturbing.


Comment by Former user
Created at 2018-10-29T15:27:00.861Z

The real problem here is a dangling ref in the mac_sw_cksum() code. This dangling ref causes a double free of the mblk which leads to allocb() returning the same mblk that it was passed (thanks to the nature of kmem object caching). It goes something like this:

1. A message with a dedicated header mblk and a data mblk with db_ref > 1 is passed to mac_sw_cksum() on Thread A.
2. mac_sw_cksum() does a copyb/free on the data mblk, but doesn't update the header mblk's b_cont -- this leaves a dangling reference to a freed object in the header mblk.
3. Thread B reuses the newly freed mblk to send some TCP data and enters the tcp_output() function.
4. Thread A frees the header mblk and thus double frees the data mblk (the one now referenced by Thread B).
5. Thread B calls dupb() as part of tcp_output(). It returns the newly (double) freed data mblk, the same mblk that was passed as argument to dupb.
6. Thread B writes TCP data into the b_prev and b_next pointers of the original mblk -- which is really the same as the dup'd mblk.

I wrote this dtrace script to catch the impossible allocb() in action.

[root@headnode (coal-1) ~]# dtrace -qn '::tcp_output:entry { self->t = 1; } ::dupb:entry /self->t/ { this->mp = args[0]; } ::dupb:return /self->t && this->mp == args[1]/ { printf("Returned same mblk: 0x%p", this->mp); stack(); } ::tcp_output:return { self->t = 0; }'

Then I reproduced the original issue simply by booting COAL and waiting. I then verified the expected panic message, stack, and dtrace output in the crash dump.

[root@headnode (coal-1) ~]# mdb /var/crash/volatile/vmcore.4
mdb: warning: dump is from SunOS 5.11 joyent_20181026T161933Z; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba mm fctl stmf_sbd stmf zfs lofs mpt idm sd fcp crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 
> ::status
debugging crash dump /var/crash/volatile/vmcore.4 (64-bit) from headnode
operating system: 5.11 joyent_20181026T161933Z (i86pc)
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe00115e83a0 addr=d59e956a occurred in module "mac" due to an illegal access to a user address

> $C
fffffe00115e84d0 mac_protect_check+0x57(fffffe0c1064ab40, fffffe0bf18ce400)
fffffe00115e8580 mac_tx+0x34a(fffffe0c1064ab40, fffffe0bf18ce400, c1919bcc, 1, 0)
fffffe00115e85d0 str_mdata_fastpath_put+0x53(fffffe0be12e8858, fffffe0bf18ce400, c1919bcc, 1)
fffffe00115e86e0 ip_xmit+0x835(fffffe0bf18ce400, fffffe0c1a3994e0, 180036060, 6b, c1919bcc, 0, 4200000000, fffffe0c8cdcdc68)
fffffe00115e8940 ire_send_wire_v4+0x401(fffffe0c379cbca8, fffffe0bf18ce400, fffffe0bf5b8a274, fffffe0c8cdcdb00, fffffe0bf9cd5920)
fffffe00115e89c0 conn_ip_output+0x190(fffffe0bf18ce400, fffffe0c8cdcdb00)
fffffe00115e89f0 tcp_send_data+0x59(fffffe0c8cde63c0, fffffe0bf18ce400)
fffffe00115e8a80 tcp_output+0x58c(fffffe0c8cde60c0, fffffe0bf18ce400, fffffe0bd3822c40, 0)
fffffe00115e8b10 squeue_enter+0x42c(fffffe0bd3822c40, fffffe0bf18ce400, fffffe0bf18ce400, 1, 0, 4, fffffe0000000007)
fffffe00115e8b80 tcp_sendmsg+0x14f(fffffe0c8cde60c0, fffffe0bf18ce400, fffffe00115e8c90, fffffe0ca92557e8)
fffffe00115e8c10 so_sendmsg+0x247(fffffe0c12a5e7e8, fffffe00115e8c90, fffffe00115e8de0, fffffe0ca92557e8)
fffffe00115e8c70 socket_sendmsg+0x56(fffffe0c12a5e7e8, fffffe00115e8c90, fffffe00115e8de0, fffffe0ca92557e8)
fffffe00115e8d10 socket_vop_write+0x6c(fffffe0be8065680, fffffe00115e8de0, 0, fffffe0ca92557e8, 0)
fffffe00115e8db0 fop_write+0xf3(fffffe0be8065680, fffffe00115e8de0, 0, fffffe0ca92557e8, 0)
fffffe00115e8e80 write+0x250(b, 8f3b1d0, 37)
fffffe00115e8eb0 write32+0x1e(b, 8f3b1d0, 37)
fffffe00115e8f10 _sys_sysenter_post_swapgs+0x153()

> ::dtrace_state 
            ADDR MINOR             PROC NAME                         FILE
fffffe0bf5d51100     2 fffffe0bfd8d1078 dtrace           fffffe0bf7989ed8
> fffffe0bf5d51100::dtrace
Returned same mblk: 0xfffffe0bf18ce400
              ip`tcp_output+0x26b
              ip`squeue_enter+0x42c
              ip`tcp_sendmsg+0x14f
              sockfs`so_sendmsg+0x247
              sockfs`socket_sendmsg+0x56
              sockfs`socket_vop_write+0x6c
              genunix`fop_write+0xf3
              genunix`write+0x250
              genunix`write32+0x1e
              unix`_sys_sysenter_post_swapgs+0x153

The dangling ref is caused by the following code in mac_sw_cksum(). If the first mblk consists solely of the ethernet header then we move it to the side temporarily.

	if (MBLKL(mp) == offset) {
		offset -= MBLKL(mp);
		/* This is guaranteed by mac_hw_emul(). */
		ASSERT3P(mp->b_cont, !=, NULL);
		skipped_hdr = mp;
		mp = mp->b_cont;

If the data block has more than one reference, we create a copy and drop our reference to the original. But we neglect to update the skipped_hdr->b_cont pointer. So any message that meets both these conditions will have its data block double freed.

	if (DB_REF(mp) > 1) {
		mblk_t *tmp = copyb(mp);

		if (tmp == NULL) {
			mac_drop_pkt(mp, "copyb failed");
			return (NULL);
		}

		tmp->b_cont = mp->b_cont;
		freeb(mp);
		mp = tmp;
	}

Comment by Former user
Created at 2018-10-29T22:25:18.492Z

I tested the fix by booting a COAL on top of the new PI and making it sure it doesn't crash (without the fix COAL reliably crashes within the first few minutes every boot).


Comment by Jira Bot
Created at 2018-11-01T18:00:00.734Z

illumos-joyent commit e6151003c8efce19988229b39ba3dbb5551dc6f1 (branch master, by Ryan Zezeski)

OS-2340 vnics should support LSO
OS-6778 MAC loopback traffic should avoid cksum work
OS-6794 want LSO support in viona
OS-7319 dangling ref in mac_sw_cksum()
OS-7331 mac_sw_cksum() drops valid UDP traffic
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>