OS-4198: Unable to Run Clozure Common Lisp in LX branded zones

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2015-04-17T11:45:18.000Z
Updated at:2017-12-14T17:24:19.174Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Duplicate: The problem is a duplicate of an existing issue.
(Resolution Date: 2015-10-21T17:57:40.000Z)

Related Issues

Labels

lxbrand

Description

Hi,

Customer has written in with the following:

I wonder if you have any idea why CCL won’t start up in an lx zone (64 BIT). I downloaded it from here: http://ccl.clozure.com/download.html and tried to run the 64 binary for Linux in an Ubuntu 14.04 lx zone. I have pasted the result at the bottom of this email. Ultimately I will probably look to using the Solaris version of CCL, which works fine in SmartOS, but at the moment our application is built on the Linux version, so I need to run that for the time being. If this worked it would be fantastic.

root@5661cfef-95d4-4990-b067-67ba492f9044:~/ccl# ./lx86cl64 
Unhandled exception 11 at 0x414260, context->regs at #x7fffffefef48 
Exception occurred while executing foreign code 
at start_lisp + 55 
received signal 11; faulting address: 0x2c 
address not mapped to object 
? for help 
[10279] Clozure CL kernel debugger: B

Frame pointer [#x0] in unknown area. 
[10279] Clozure CL kernel debugger: T 
Current Thread Context Record (tcr) = 0x7fffff100770 
Control (C) stack area: low = 0x7fffffc9c000, high = 0x7fffffeff630 
Value (lisp) stack area: low = 0x7fffddc00000, high = 0x7fffdde4c000 
Exception stack pointer = 0x7fffffeff5d0 
[10279] Clozure CL kernel debugger: M 
Lisp memory areas: 
code low high 
dynamic (9) 0x3020004592c0 0x302002460000 
dynamic (9) 0x3020004592c0 0x3020004592c0 
dynamic (9) 0x3020004592c0 0x3020004592c0 
dynamic (9) 0x302000000000 0x3020004592c0 
static (8) 0x12000 0x14000 
managed static (7) 0x300040000000 0x300040523000 
readonly (4) 0x300000000000 0x300040000000 
tstack (3) 0x7fffddf9c000 0x7fffddffe000 
vstack (2) 0x7fffddc00000 0x7fffdde4c000 
cstack (1) 0x7fffffc9c000 0x7fffffeff630

Jay

Comments

Comment by Former user
Created at 2015-04-17T12:06:41.000Z

@accountid:62561aa213c2c8006a35f07b

I see that this has been brought up over on smartos-discuss as well - wanted to tie that to this ticket.

Presumably by platform you mean the SmartOS version:-
# uname -a
SunOS 00-26-b9-54-f3-87 5.11 joyent_20150320T034246Z i86pc i386 i86pc

The image uuid that the lx zone was created from is:-
818cc79e-ceb3-11e4-99ee-7bc8c674e754
(ubuntu-14.04-lx 20150320 LX-brand BETA image.)

Jay


Comment by Former user
Created at 2015-04-17T12:15:03.000Z
Updated at 2017-12-14T17:24:16.408Z

I downloaded the sw and reproduced this segv on the latest platform.


Comment by Former user
Created at 2015-04-17T14:58:01.000Z
Updated at 2017-12-14T17:24:16.627Z

The code seems to be expecting the segment for the %gs register to be setup a certain way. In my zone I get this:

Unhandled exception 11 at 0x414200, context->regs at #x7fffffeff288
Exception occurred while executing foreign code
 at start_lisp + 55
received signal 11; faulting address: 0x2c

If we look at the code where we took the exception we see:

> 414200::dis
start_lisp+0x17:                xorq   %r8,%r8
start_lisp+0x1a:                xorq   %rbx,%rbx
start_lisp+0x1d:                xorq   %r9,%r9
start_lisp+0x20:                xorq   %r10,%r10
start_lisp+0x23:                xorq   %r13,%r13
start_lisp+0x26:                xorq   %r15,%r15
start_lisp+0x29:                xorq   %r14,%r14
start_lisp+0x2c:                xorq   %r12,%r12
start_lisp+0x2f:                xorq   %r11,%r11
start_lisp+0x32:                pxor   %xmm15,%xmm15
start_lisp+0x37:                stmxcsr %gs:0x2c                       <-------
start_lisp+0x40:                andb   $0xc0,%gs:0x2c
start_lisp+0x49:                ldmxcsr %gs:0x28
start_lisp+0x52:                movq   $0x0,%gs:0xb0
start_lisp+0x5f:                call   -0x14d   <0x4140e0>
start_lisp+0x64:                movq   $0x1,%gs:0xb0
start_lisp+0x71:                emms   
start_lisp+0x73:                addq   $0x8,%rsp
start_lisp+0x77:                popq   %r15
start_lisp+0x79:                popq   %r14
start_lisp+0x7b:                popq   %r13

Our %gs register value is 0. When I run this in a kvm zone then attach to the process with gdb, I see this for the register values:

(gdb) info registers
rax            0xfffffffffffffdfc	-516
rbx            0x3020000b1253	52913997812307
rcx            0xffffffffffffffff	-1
rdx            0x0	0
rsi            0x7fff651f8b00	140734889954048
rdi            0x7fff651f8b40	140734889954112
rbp            0x7fff651f8ad0	0x7fff651f8ad0
rsp            0x7fff651f8ac0	0x7fff651f8ac0
r8             0x7fff651f8af0	140734889954032
r9             0x7fd7944e5ef8	140563882860280
r10            0x7fff651f8a90	140734889953936
r11            0x293	659
r12            0x30004000fe1e	52777631940126
r13            0x3000004a675f	52776563009375
r14            0x7fff651f8b2d	140734889954093
r15            0x7fff651f8aed	140734889954029
rip            0x7fd7b48c0b9d	0x7fd7b48c0b9d <nanosleep+45>
eflags         0x293	[ CF AF SF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0

So it is not the %gs value per se, it must be how the segment is setup?


Comment by Former user
Created at 2015-04-17T16:03:05.000Z
Updated at 2017-12-14T17:24:16.671Z

It looks like we're running into the missing support for setting gs in the arch_prctl emulation. Here is the result of the arch_prctl calls from this code:

lx86cl64 105594 1 <- arch_prctl(0x1002 0xfede0900)=0
lx86cl64 105594 1 <- arch_prctl(0x1004 0xffeff928)=-22
lx86cl64 105594 1 <- arch_prctl(0x1003 0xffeff920)=0
lx86cl64 105594 1 <- arch_prctl(0x1001 0xfede0770)=-22

Comment by Former user
Created at 2015-04-17T20:52:29.000Z
Updated at 2017-12-14T17:24:17.122Z

I have implemented the GS support in arch_prctl and we now get further before we abort with a corrupt signal handling stack. I spent some time debugging this to see if it is related to the arch_prctl change but it doesn't look like it. When I compare to an strace of this code running in kvm we execute the same stuff for a while after the arch_prctl call and eventually take the sigsegv on kvm too. However, in that case the stack is obviously not corrupt so we continue to execute. Thus, I am going to leave this bug open for tracking down the corrupt signal stack and fix the missing arch_prctl support under OS-4201.


Comment by Former user
Created at 2015-04-17T22:11:07.000Z
Updated at 2017-12-14T17:24:17.236Z

With the fix for OS-4201 in place, we die in lx_rt_sigreturn with a bad signal stack. Here is the abort msg when I have LX_VERBOSE set:

sp @ 0x7fffffeff5d0 ucp 0x7fffff04ddf0, expected 0xdeadf00d, found 0xbb403b!

Looking at this region of the stack we can see that we somehow got off by 128 bytes from where we are looking for LX_SIGRT_MAGIC:

> 7fffffeff500,128::dump
               \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
7fffffeff500:  00000000 00000000 00000000 00000000  ................
7fffffeff510:  00000000 00000000 00000000 00000000  ................
7fffffeff520:  00000000 00000000 00000000 00000000  ................
7fffffeff530:  00000000 00000000 00000000 00000000  ................
7fffffeff540:  00000000 00000000 00000000 00000000  ................
7fffffeff550:  0df0adde 00000000 a0e504ff ff7f0000  ................
7fffffeff560:  20eb04ff ff7f0000 00000000 00000000   ...............
7fffffeff570:  00000000 00000000 00000000 00000000  ................
7fffffeff580:  00000000 00000000 00000000 00000000  ................
7fffffeff590:  00000000 00000000 00000000 00000000  ................
7fffffeff5a0:  00000000 00000000 00000000 00000000  ................
7fffffeff5b0:  00000000 00000000 c804defe ff7f0000  ................
7fffffeff5c0:  0b000000 00000000 01000000 00000000  ................
7fffffeff5d0:  3b40bb00 00300000 00000000 00000000  ;@...0..........
7fffffeff5e0:  00000000 00000000 00000000 00000000  ................
7fffffeff5f0:  00000000 00000000 00000000 00000000  ................
7fffffeff600:  00000000 00000000 00000000 00000000  ................
7fffffeff610:  00000000 00000000 00000000 00000000  ................
7fffffeff620:  00000000 00000000 00000000 00000000  ................

Comment by Former user
Created at 2015-04-20T18:15:11.000Z
Updated at 2017-12-14T17:24:18.905Z

The stack dump above is a red herring. What is happening here is that we've setup to take signals on an alt stack using sigaltstack. When we take the signal we construct our frame with LX_SIGRT_MAGIC on the alt stack. While in the signal handler the user-level code calls sigaltstack to retrieve the info about the original stack and manually switches itself back onto that stack (i.e. the user-level code must be explicitly setting it's sp since there is no syscall to switch). The signal handler makes a couple more rt_sigprocmask syscalls which we can see are now being handled on the main stack. Then the signal handler calls rt_sigreturn but since our context data is saved on the alt stack, we don't find what we're looking for and we abort.


Comment by Former user
Created at 2015-04-20T19:18:47.000Z
Updated at 2017-12-14T17:24:19.144Z

For me the easiest thing to see exactly what was happening here was to set LX_VERBOSE and LX_DEBUG then just run ./lx86cl64. It is then pretty easy to see the behavior with the alt stack.

See also the app src here: switch_to_foreign_stack in lisp-kernel/x86-asmutils64.s and handle_signal_on_foreign_stack in lisp-kernel/x86-exceptions.c


Comment by Former user
Created at 2015-04-22T00:22:31.000Z

The signal handling code in this Lisp implementation is absolutely abysmal; a true crime against the field. The code makes a copy of the ucontext_t and the siginfo_t from the sigaltstack(2) alternative stack (where the signal is delivered) onto some bizzaro mirror universe Lisp stack. Later, rt_sigreturn(2) is called with %rsp set to the target of the copying; i.e. to a location on the Lisp stack.

Tragically, copying these two objects independently like this assumes something which is really not part of the signal handler interface: the ordering of the ucontext_t and siginfo_t on the signal delivery stack. The only thing that a signal handler should consume here is the locations of these structures, which is what the kernel passes to the handler as arguments. On 64-bit Linux, the layout is different to the 32-bit Linux layout; our code must be changed to reflect this difference or Clozure Common Lisp cannot be expected to function.

I have never come across a Lisp implementation that seemed like a good idea, and this one is doing the field of functional programming no favours in that respect.


Comment by Former user
Created at 2015-10-21T17:57:06.000Z

This appears to work -- I believe it was addressed by OS-4215. I'm closing this out as a dup of OS-4215 for now (and it would be great to reach back out to the customer to re-test their app).


Comment by Former user
Created at 2015-10-22T11:39:39.000Z

@accountid:62561aa04f1d57006a24d403 - just fyi, the customer writes back:

----

Just to confirm, this fix will be present in the following SmartOS build won’t it: SunOS 0c-c4-7a-6a-3c-86 5.11 joyent_20150709T171818Z i86pc i386 i86pc ?

Assuming this is the case, a while ago I downloaded an updated build and found that the Clozure Common Lisp version we are using did run in it, whereas before it immediately dropped into the lisp kernel debugger, so it seems that it is working. Likewise on the build above it’s working ok.

The only issue that I then found is that when I loaded a large lisp image I had saved and it started doing things which caused lots of memory to be allocated it caused a ‘lisp panic’ again. I narrowed this down to simply allocating memory in a tight loop. HOWEVER, even though this didn’t seem to crop up as much on a ‘normal’ Linux system, there is mention of there being an issue with the GC in that version of CCL (when I Google for the error message which came up), so I tried in a later version and there is no problem, which suggests that it’s just a problem in Clozure Common Lisp 1.6 which happens to show up more quickly in an LX zone than in normal linux. Our CCL does sometimes ‘panic’ in normal linux too, but it’s fairly rare.

I could send more details if required, but I think it probably makes sense to just use the newer release of CCL and assume that it’s 1.6 which is broken and not an issue with the LX stuff. Thank you very much for fixing this issue (to whoever fixed it) - from the bug report it looks like CCL is doing some really funky stuff which was a real challenge for the LX compatibility layer. Sorry I managed to find such awkward software :) Also, so far I’m extremely impressed by how well the LX brand works. I just need to start deploying it in production and see how it shakes down.

----