Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2015-04-17T11:45:18.000Z |
Updated at: | 2017-12-14T17:24:19.174Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Duplicate: The problem is a duplicate of an existing issue.
(Resolution Date: 2015-10-21T17:57:40.000Z)
Hi,
Customer has written in with the following:
I wonder if you have any idea why CCL won’t start up in an lx zone (64 BIT). I downloaded it from here: http://ccl.clozure.com/download.html and tried to run the 64 binary for Linux in an Ubuntu 14.04 lx zone. I have pasted the result at the bottom of this email. Ultimately I will probably look to using the Solaris version of CCL, which works fine in SmartOS, but at the moment our application is built on the Linux version, so I need to run that for the time being. If this worked it would be fantastic.
root@5661cfef-95d4-4990-b067-67ba492f9044:~/ccl# ./lx86cl64 Unhandled exception 11 at 0x414260, context->regs at #x7fffffefef48 Exception occurred while executing foreign code at start_lisp + 55 received signal 11; faulting address: 0x2c address not mapped to object ? for help [10279] Clozure CL kernel debugger: B Frame pointer [#x0] in unknown area. [10279] Clozure CL kernel debugger: T Current Thread Context Record (tcr) = 0x7fffff100770 Control (C) stack area: low = 0x7fffffc9c000, high = 0x7fffffeff630 Value (lisp) stack area: low = 0x7fffddc00000, high = 0x7fffdde4c000 Exception stack pointer = 0x7fffffeff5d0 [10279] Clozure CL kernel debugger: M Lisp memory areas: code low high dynamic (9) 0x3020004592c0 0x302002460000 dynamic (9) 0x3020004592c0 0x3020004592c0 dynamic (9) 0x3020004592c0 0x3020004592c0 dynamic (9) 0x302000000000 0x3020004592c0 static (8) 0x12000 0x14000 managed static (7) 0x300040000000 0x300040523000 readonly (4) 0x300000000000 0x300040000000 tstack (3) 0x7fffddf9c000 0x7fffddffe000 vstack (2) 0x7fffddc00000 0x7fffdde4c000 cstack (1) 0x7fffffc9c000 0x7fffffeff630
Jay
@accountid:62561aa213c2c8006a35f07b
I see that this has been brought up over on smartos-discuss as well - wanted to tie that to this ticket.
Presumably by platform you mean the SmartOS version:- # uname -a SunOS 00-26-b9-54-f3-87 5.11 joyent_20150320T034246Z i86pc i386 i86pc The image uuid that the lx zone was created from is:- 818cc79e-ceb3-11e4-99ee-7bc8c674e754 (ubuntu-14.04-lx 20150320 LX-brand BETA image.)
Jay
I downloaded the sw and reproduced this segv on the latest platform.
The code seems to be expecting the segment for the %gs register to be setup a certain way. In my zone I get this:
Unhandled exception 11 at 0x414200, context->regs at #x7fffffeff288 Exception occurred while executing foreign code at start_lisp + 55 received signal 11; faulting address: 0x2c
If we look at the code where we took the exception we see:
> 414200::dis start_lisp+0x17: xorq %r8,%r8 start_lisp+0x1a: xorq %rbx,%rbx start_lisp+0x1d: xorq %r9,%r9 start_lisp+0x20: xorq %r10,%r10 start_lisp+0x23: xorq %r13,%r13 start_lisp+0x26: xorq %r15,%r15 start_lisp+0x29: xorq %r14,%r14 start_lisp+0x2c: xorq %r12,%r12 start_lisp+0x2f: xorq %r11,%r11 start_lisp+0x32: pxor %xmm15,%xmm15 start_lisp+0x37: stmxcsr %gs:0x2c <------- start_lisp+0x40: andb $0xc0,%gs:0x2c start_lisp+0x49: ldmxcsr %gs:0x28 start_lisp+0x52: movq $0x0,%gs:0xb0 start_lisp+0x5f: call -0x14d <0x4140e0> start_lisp+0x64: movq $0x1,%gs:0xb0 start_lisp+0x71: emms start_lisp+0x73: addq $0x8,%rsp start_lisp+0x77: popq %r15 start_lisp+0x79: popq %r14 start_lisp+0x7b: popq %r13
Our %gs register value is 0. When I run this in a kvm zone then attach to the process with gdb, I see this for the register values:
(gdb) info registers rax 0xfffffffffffffdfc -516 rbx 0x3020000b1253 52913997812307 rcx 0xffffffffffffffff -1 rdx 0x0 0 rsi 0x7fff651f8b00 140734889954048 rdi 0x7fff651f8b40 140734889954112 rbp 0x7fff651f8ad0 0x7fff651f8ad0 rsp 0x7fff651f8ac0 0x7fff651f8ac0 r8 0x7fff651f8af0 140734889954032 r9 0x7fd7944e5ef8 140563882860280 r10 0x7fff651f8a90 140734889953936 r11 0x293 659 r12 0x30004000fe1e 52777631940126 r13 0x3000004a675f 52776563009375 r14 0x7fff651f8b2d 140734889954093 r15 0x7fff651f8aed 140734889954029 rip 0x7fd7b48c0b9d 0x7fd7b48c0b9d <nanosleep+45> eflags 0x293 [ CF AF SF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
So it is not the %gs value per se, it must be how the segment is setup?
It looks like we're running into the missing support for setting gs in the arch_prctl
emulation. Here is the result of the arch_prctl
calls from this code:
lx86cl64 105594 1 <- arch_prctl(0x1002 0xfede0900)=0 lx86cl64 105594 1 <- arch_prctl(0x1004 0xffeff928)=-22 lx86cl64 105594 1 <- arch_prctl(0x1003 0xffeff920)=0 lx86cl64 105594 1 <- arch_prctl(0x1001 0xfede0770)=-22
I have implemented the GS support in arch_prctl
and we now get further before we abort with a corrupt signal handling stack. I spent some time debugging this to see if it is related to the arch_prctl
change but it doesn't look like it. When I compare to an strace of this code running in kvm we execute the same stuff for a while after the arch_prctl
call and eventually take the sigsegv on kvm too. However, in that case the stack is obviously not corrupt so we continue to execute. Thus, I am going to leave this bug open for tracking down the corrupt signal stack and fix the missing arch_prctl
support under OS-4201.
With the fix for OS-4201 in place, we die in lx_rt_sigreturn
with a bad signal stack. Here is the abort msg when I have LX_VERBOSE set:
sp @ 0x7fffffeff5d0 ucp 0x7fffff04ddf0, expected 0xdeadf00d, found 0xbb403b!
Looking at this region of the stack we can see that we somehow got off by 128 bytes from where we are looking for LX_SIGRT_MAGIC:
> 7fffffeff500,128::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef 7fffffeff500: 00000000 00000000 00000000 00000000 ................ 7fffffeff510: 00000000 00000000 00000000 00000000 ................ 7fffffeff520: 00000000 00000000 00000000 00000000 ................ 7fffffeff530: 00000000 00000000 00000000 00000000 ................ 7fffffeff540: 00000000 00000000 00000000 00000000 ................ 7fffffeff550: 0df0adde 00000000 a0e504ff ff7f0000 ................ 7fffffeff560: 20eb04ff ff7f0000 00000000 00000000 ............... 7fffffeff570: 00000000 00000000 00000000 00000000 ................ 7fffffeff580: 00000000 00000000 00000000 00000000 ................ 7fffffeff590: 00000000 00000000 00000000 00000000 ................ 7fffffeff5a0: 00000000 00000000 00000000 00000000 ................ 7fffffeff5b0: 00000000 00000000 c804defe ff7f0000 ................ 7fffffeff5c0: 0b000000 00000000 01000000 00000000 ................ 7fffffeff5d0: 3b40bb00 00300000 00000000 00000000 ;@...0.......... 7fffffeff5e0: 00000000 00000000 00000000 00000000 ................ 7fffffeff5f0: 00000000 00000000 00000000 00000000 ................ 7fffffeff600: 00000000 00000000 00000000 00000000 ................ 7fffffeff610: 00000000 00000000 00000000 00000000 ................ 7fffffeff620: 00000000 00000000 00000000 00000000 ................
The stack dump above is a red herring. What is happening here is that we've setup to take signals on an alt stack using sigaltstack. When we take the signal we construct our frame with LX_SIGRT_MAGIC on the alt stack. While in the signal handler the user-level code calls sigaltstack to retrieve the info about the original stack and manually switches itself back onto that stack (i.e. the user-level code must be explicitly setting it's sp since there is no syscall to switch). The signal handler makes a couple more rt_sigprocmask syscalls which we can see are now being handled on the main stack. Then the signal handler calls rt_sigreturn but since our context data is saved on the alt stack, we don't find what we're looking for and we abort.
For me the easiest thing to see exactly what was happening here was to set LX_VERBOSE
and LX_DEBUG
then just run ./lx86cl64
. It is then pretty easy to see the behavior with the alt stack.
See also the app src here: switch_to_foreign_stack in lisp-kernel/x86-asmutils64.s and handle_signal_on_foreign_stack in lisp-kernel/x86-exceptions.c
The signal handling code in this Lisp implementation is absolutely abysmal; a true crime against the field. The code makes a copy of the ucontext_t
and the siginfo_t
from the sigaltstack(2)
alternative stack (where the signal is delivered) onto some bizzaro mirror universe Lisp stack. Later, rt_sigreturn(2)
is called with %rsp
set to the target of the copying; i.e. to a location on the Lisp stack.
Tragically, copying these two objects independently like this assumes something which is really not part of the signal handler interface: the ordering of the ucontext_t
and siginfo_t
on the signal delivery stack. The only thing that a signal handler should consume here is the locations of these structures, which is what the kernel passes to the handler as arguments. On 64-bit Linux, the layout is different to the 32-bit Linux layout; our code must be changed to reflect this difference or Clozure Common Lisp cannot be expected to function.
I have never come across a Lisp implementation that seemed like a good idea, and this one is doing the field of functional programming no favours in that respect.
This appears to work -- I believe it was addressed by OS-4215. I'm closing this out as a dup of OS-4215 for now (and it would be great to reach back out to the customer to re-test their app).
@accountid:62561aa04f1d57006a24d403 - just fyi, the customer writes back:
----
Just to confirm, this fix will be present in the following SmartOS build won’t it: SunOS 0c-c4-7a-6a-3c-86 5.11 joyent_20150709T171818Z i86pc i386 i86pc ?
Assuming this is the case, a while ago I downloaded an updated build and found that the Clozure Common Lisp version we are using did run in it, whereas before it immediately dropped into the lisp kernel debugger, so it seems that it is working. Likewise on the build above it’s working ok.
The only issue that I then found is that when I loaded a large lisp image I had saved and it started doing things which caused lots of memory to be allocated it caused a ‘lisp panic’ again. I narrowed this down to simply allocating memory in a tight loop. HOWEVER, even though this didn’t seem to crop up as much on a ‘normal’ Linux system, there is mention of there being an issue with the GC in that version of CCL (when I Google for the error message which came up), so I tried in a later version and there is no problem, which suggests that it’s just a problem in Clozure Common Lisp 1.6 which happens to show up more quickly in an LX zone than in normal linux. Our CCL does sometimes ‘panic’ in normal linux too, but it’s fairly rare.
I could send more details if required, but I think it probably makes sense to just use the newer release of CCL and assume that it’s 1.6 which is broken and not an issue with the LX stuff. Thank you very much for fixing this issue (to whoever fixed it) - from the bug report it looks like CCL is doing some really funky stuff which was a real challenge for the LX compatibility layer. Sorry I managed to find such awkward software :) Also, so far I’m extremely impressed by how well the LX brand works. I just need to start deploying it in production and see how it shakes down.
----