Bug 11214 - tst-getpid2 should be made robust against CLONE_VM race in ld.so.
: tst-getpid2 should be made robust against CLONE_VM race in ld.so.
Status: REOPENED
Product: glibc
Classification: Unclassified
Component: dynamic-link
: 2.15
: P2 normal
: ---
Assigned To: Not yet assigned to anyone
:
:
:
:
  Show dependency treegraph
 
Reported: 2010-01-23 16:39 UTC by H.J. Lu
Modified: 2012-12-19 19:23 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-pc-linux-gnu
Build:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2010-01-23 16:39:09 UTC
On Intel Core i7, I saw

/var/log/messages-20100110:Jan  8 14:39:35 gnu-6 klogd: gdbserver[20988] trap
invalid opcode ip:3df7414959 sp:7fffc77d9808 error:0 in
ld-2.11.1.so[3df7400000+1e000]

when I did "make check" in gdb. The corresponding code is

_dl_x86_64_restore_sse:
# ifdef HAVE_AVX_SUPPORT
        cmpl    $0, L(have_avx)(%rip)
        js      L(no_avx6)

        vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
        vmovdqa %fs:RTLD_SAVESPACE_SSE+1*YMM_SIZE, %ymm1
        vmovdqa %fs:RTLD_SAVESPACE_SSE+2*YMM_SIZE, %ymm2
        vmovdqa %fs:RTLD_SAVESPACE_SSE+3*YMM_SIZE, %ymm3
        vmovdqa %fs:RTLD_SAVESPACE_SSE+4*YMM_SIZE, %ymm4
        vmovdqa %fs:RTLD_SAVESPACE_SSE+5*YMM_SIZE, %ymm5
        vmovdqa %fs:RTLD_SAVESPACE_SSE+6*YMM_SIZE, %ymm6
        vmovdqa %fs:RTLD_SAVESPACE_SSE+7*YMM_SIZE, %ymm7
        ret

in sysdeps/x86_64/dl-trampoline.S. It seems like L(have_avx)
is 0, instead of -1, I don't see how it can happen. Maybe
gdbserver is a special case.
Comment 1 H.J. Lu 2010-01-25 14:06:35 UTC
To reproduce on Fedora 12/x86-64:

1. Get the current gdb.
2. Build gdb.
3. Run "make check RUNTESTFLAGS=server-run.exp".  It will
fail at random:

ERROR: tcl error sourcing
/export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp.
ERROR: : spawn id exp7 not open
    while executing
"expect_background -nobrace -i exp7 full_buffer { } eof {
	    # The spawn ID is already closed now (but not yet waited for).
	    wait -i $expect_out(..."
    invoked from within
"expect_background {
	-i $server_spawn_id
	full_buffer { }
	eof {
	    # The spawn ID is already closed now (but not yet waited for).
	    wait -i $exp..."
    (procedure "gdbserver_start" line 67)
    invoked from within
"gdbserver_start "" $arguments"
    (procedure "gdbserver_spawn" line 11)
    invoked from within
"gdbserver_spawn $child_args"
    (procedure "gdbserver_run" line 20)
    invoked from within
"gdbserver_run """
    (file "/export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp"
line 38)
    invoked from within
"source /export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp"
    ("uplevel" body line 1)
    invoked from within
"uplevel #0 source
/export/gnu/import/git/gdb/gdb/testsuite/gdb.server/server-run.exp"
    invoked from within
"catch "uplevel #0 source $test_file_name""

Kernel message is

gdbserver[27784] trap invalid opcode ip:3df7414959 sp:173a058 error:0 in
ld-2.11.1.so[3df7400000+1e000]
Comment 2 H.J. Lu 2010-01-26 23:40:57 UTC
_dl_x86_64_save_sse was never called. However, sometimes
_dl_x86_64_restore_sse is called via

gdb) bt
#0  _dl_check_restore (avx=622750216) at ../sysdeps/x86_64/dl-check.c:18
#1  0x00007f1924fe4a3b in _dl_x86_64_restore_sse ()
    at ../sysdeps/x86_64/dl-trampoline.S:222
#2  0x00007f1924fde315 in _dl_fixup (l=<value optimized out>, 
    reloc_arg=<value optimized out>) at ../elf/dl-runtime.c:126
#3  0x00007f1924fe43c5 in _dl_runtime_resolve ()
    at ../sysdeps/x86_64/dl-trampoline.S:41
#4  0x0000000000410f63 in linux_tracefork_child (arg=0x7f19251e8000)
    at /export/gnu/import/git/gdb/gdb/gdbserver/linux-low.c:2587
#5  0x00007f1924b3524d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) 

gdb uses

  child_pid = clone (linux_tracefork_child, stack + STACK_SIZE,
                     CLONE_VM | SIGCHLD, stack + STACK_SIZE * 2);


static int
linux_tracefork_child (void *arg)
{
  ptrace (PTRACE_TRACEME, 0, 0, 0);
  kill (getpid (), SIGSTOP);
  clone (linux_tracefork_grandchild, arg + STACK_SIZE,
         CLONE_VM | SIGCHLD, NULL);
  exit (0);
}

Since 2 processes share the TLS and memory space, there is
a race condition. Maybe gdb shouldn't use CLONE_VM for x86-64
or use "-z now" for linking.
Comment 3 H.J. Lu 2010-01-27 00:04:17 UTC
With CLONE_VM, THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save)
may be updated by 2 processes at the same time since parent and
child share the same TLS.
Comment 4 H.J. Lu 2010-01-27 04:09:01 UTC
We can put wrapper of clone in nptl. If clone is
called with CLONE_VM, we mark ld.so TLS is shared.
Comment 5 Jakub Jelinek 2010-01-27 06:50:02 UTC
IMNSHO this is a gdb bug, there is no point hacking up something in this ld.so
case when many other things break equally horribly when using CLONE_VM without
cloning TLS - everything that uses __thread or other thread local area fields is
broken in that case.  While the kernel supports all flags for clone, glibc
supports only a limited subset of the combinations.
Comment 6 H.J. Lu 2010-01-27 13:02:43 UTC
CLONE_VM is used by many applications. What do they have to do
to clone TLS?
Comment 7 H.J. Lu 2010-01-27 13:55:20 UTC
(In reply to comment #5)
> IMNSHO this is a gdb bug, there is no point hacking up something in this ld.so
> case when many other things break equally horribly when using CLONE_VM without
> cloning TLS - everything that uses __thread or other thread local area fields is
> broken in that case.  While the kernel supports all flags for clone, glibc
> supports only a limited subset of the combinations.

Those applications don't use TLS and ld.so uses TLS behind their
back. Shouldn't ld.so use TLS only if libpthread is used?
Comment 8 Jan Kratochvil 2010-01-27 22:14:23 UTC
Posted for GCDB:
http://sourceware.org/ml/gdb-patches/2010-01/msg00599.html
Comment 9 Jan Kratochvil 2010-02-01 20:24:00 UTC
For GDB it is now checked in FSF GDB:
http://sourceware.org/ml/gdb-patches/2010-02/msg00028.html
Comment 10 Ulrich Drepper 2010-04-04 09:16:09 UTC
Not a glibc problem.
Comment 11 H.J. Lu 2012-01-25 22:42:24 UTC
nptl/tst-getpid2.c has

---
#define TEST_CLONE_FLAGS CLONE_VM
#include "tst-getpid1.c"
---

which calls clone with CLONE_VM and leads to nptl/tst-getpid2.c fails at
random with "illegal hardware instruction".