Sources Bugzilla – Bug 11214
tst-getpid2 should be made robust against CLONE_VM race in ld.so.
Last modified: 2012-12-19 19:23:51 UTC
On Intel Core i7, I saw /var/log/messages-20100110:Jan 8 14:39:35 gnu-6 klogd: gdbserver[20988] trap invalid opcode ip:3df7414959 sp:7fffc77d9808 error:0 in ld-2.11.1.so[3df7400000+1e000] when I did "make check" in gdb. The corresponding code is _dl_x86_64_restore_sse: # ifdef HAVE_AVX_SUPPORT cmpl $0, L(have_avx)(%rip) js L(no_avx6) vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0 vmovdqa %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE, %ymm1 vmovdqa %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE, %ymm2 vmovdqa %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE, %ymm3 vmovdqa %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE, %ymm4 vmovdqa %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE, %ymm5 vmovdqa %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE, %ymm6 vmovdqa %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE, %ymm7 ret in sysdeps/x86_64/dl-trampoline.S. It seems like L(have_avx) is 0, instead of -1, I don't see how it can happen. Maybe gdbserver is a special case.
To reproduce on Fedora 12/x86-64: 1. Get the current gdb. 2. Build gdb. 3. Run "make check RUNTESTFLAGS=server-run.exp". It will fail at random: ERROR: tcl error sourcing /export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp. ERROR: : spawn id exp7 not open while executing "expect_background -nobrace -i exp7 full_buffer { } eof { # The spawn ID is already closed now (but not yet waited for). wait -i $expect_out(..." invoked from within "expect_background { -i $server_spawn_id full_buffer { } eof { # The spawn ID is already closed now (but not yet waited for). wait -i $exp..." (procedure "gdbserver_start" line 67) invoked from within "gdbserver_start "" $arguments" (procedure "gdbserver_spawn" line 11) invoked from within "gdbserver_spawn $child_args" (procedure "gdbserver_run" line 20) invoked from within "gdbserver_run """ (file "/export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp" line 38) invoked from within "source /export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp" ("uplevel" body line 1) invoked from within "uplevel #0 source /export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp" invoked from within "catch "uplevel #0 source $test_file_name"" Kernel message is gdbserver[27784] trap invalid opcode ip:3df7414959 sp:173a058 error:0 in ld-2.11.1.so[3df7400000+1e000]
_dl_x86_64_save_sse was never called. However, sometimes _dl_x86_64_restore_sse is called via gdb) bt #0 _dl_check_restore (avx=622750216) at ../sysdeps/x86_64/dl-check.c:18 #1 0x00007f1924fe4a3b in _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:222 #2 0x00007f1924fde315 in _dl_fixup (l=<value optimized out>, reloc_arg=<value optimized out>) at ../elf/dl-runtime.c:126 #3 0x00007f1924fe43c5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:41 #4 0x0000000000410f63 in linux_tracefork_child (arg=0x7f19251e8000) at /export/gnu/import/git/gdb/gdb/gdbserver/linux-low.c:2587 #5 0x00007f1924b3524d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) gdb uses child_pid = clone (linux_tracefork_child, stack + STACK_SIZE, CLONE_VM | SIGCHLD, stack + STACK_SIZE * 2); static int linux_tracefork_child (void *arg) { ptrace (PTRACE_TRACEME, 0, 0, 0); kill (getpid (), SIGSTOP); clone (linux_tracefork_grandchild, arg + STACK_SIZE, CLONE_VM | SIGCHLD, NULL); exit (0); } Since 2 processes share the TLS and memory space, there is a race condition. Maybe gdb shouldn't use CLONE_VM for x86-64 or use "-z now" for linking.
With CLONE_VM, THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) may be updated by 2 processes at the same time since parent and child share the same TLS.
We can put wrapper of clone in nptl. If clone is called with CLONE_VM, we mark ld.so TLS is shared.
IMNSHO this is a gdb bug, there is no point hacking up something in this ld.so case when many other things break equally horribly when using CLONE_VM without cloning TLS - everything that uses __thread or other thread local area fields is broken in that case. While the kernel supports all flags for clone, glibc supports only a limited subset of the combinations.
CLONE_VM is used by many applications. What do they have to do to clone TLS?
(In reply to comment #5) > IMNSHO this is a gdb bug, there is no point hacking up something in this ld.so > case when many other things break equally horribly when using CLONE_VM without > cloning TLS - everything that uses __thread or other thread local area fields is > broken in that case. While the kernel supports all flags for clone, glibc > supports only a limited subset of the combinations. Those applications don't use TLS and ld.so uses TLS behind their back. Shouldn't ld.so use TLS only if libpthread is used?
Posted for GCDB: http://sourceware.org/ml/gdb-patches/2010-01/msg00599.html
For GDB it is now checked in FSF GDB: http://sourceware.org/ml/gdb-patches/2010-02/msg00028.html
Not a glibc problem.
nptl/tst-getpid2.c has --- #define TEST_CLONE_FLAGS CLONE_VM #include "tst-getpid1.c" --- which calls clone with CLONE_VM and leads to nptl/tst-getpid2.c fails at random with "illegal hardware instruction".