This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Tests that use clone directly race against SSE register save/restore.
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: "Carlos O'Donell" <carlos at redhat dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, Arjun Shankar <arjun at redhat dot com>, Roland McGrath <roland at hack dot frob dot com>
- Date: Mon, 20 Jul 2015 11:27:01 -0700
- Subject: Re: Tests that use clone directly race against SSE register save/restore.
- Authentication-results: sourceware.org; auth=none
- References: <55AD37CA dot 9040508 at redhat dot com>
On Mon, Jul 20, 2015 at 11:02 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> H.J.,
>
> On some systems we see random failures in tst-getpid1. Arjun Shankar
> reported this and I did a quick look, and found some problems with our
> tests and the use of TLS with RTLD_*CALL.
>
> The test itself is interesting because it uses clone to create
> a second thread via CLONE_VM which means that on x86_64 we have
> the same $fs for both concurrently running threads.
>
> Then the dynamic loader attempts to use TLS header.rtld_must_xmm_save
> to decide if a save/restore of the SSE/AVX/AVX512 registers is
> required. That state is now global though and shared both both racing
> threads which try to write and read from that location as they process
> a symbol lookups.
>
> The fact that both threads might write and read to the same memory
> makes this a data race and is undefined behaviour. Is the test faulty
> or should the loader implementation have used atomic operations to
> write to thread data?
>
> An example ordering that causes problems on non-AVX-enabled hardware:
>
> T1:
> 399 # define RTLD_ENABLE_FOREIGN_CALL \
> 400 int old_rtld_must_xmm_save = THREAD_GETMEM (THREAD_SELF, \
> 401 header.rtld_must_xmm_save); \
> 402 THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 1)
>
> T2:
> 399 # define RTLD_ENABLE_FOREIGN_CALL \
> 400 int old_rtld_must_xmm_save = THREAD_GETMEM (THREAD_SELF, \
> 401 header.rtld_must_xmm_save); \
> 402 THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 1)
>
> fs:header.rtld_must_xmm_save == 1
>
> T2:
>
> 110
> 111 result = _dl_lookup_symbol_x (strtab + sym->st_name, l, &sym, l->l_scope,
> 112 version, ELF_RTYPE_CLASS_PLT, flags, NULL);
> 113
>
> 404 # define RTLD_PREPARE_FOREIGN_CALL \
> 405 do if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save)) \
> 406 { \
> 407 _dl_x86_64_save_sse (); \
> 408 THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, 0); \
> 409 } \
> 410 while (0)
>
> fs:header.rtld_must_xmm_save == 0
> have_avx is initialized on this thread, but not yet visible to T1.
>
> 411
> 412 # define RTLD_FINALIZE_FOREIGN_CALL \
> 413 do { \
> 414 if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) == 0) \
> 415 _dl_x86_64_restore_sse (); \
> 416 THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, \
> 417 old_rtld_must_xmm_save); \
> 418 } while (0)
> 419 # endif
>
> T1:
>
> Despite never having called RTLD_PREPARE_FOREIGN_CALL we reach here in T1
> with headers.rtld_must_xmm_save == 0, and the writes from T2 not being
> visible to T1 yet.
>
> 411
> 412 # define RTLD_FINALIZE_FOREIGN_CALL \
> 413 do { \
> 414 if (THREAD_GETMEM (THREAD_SELF, header.rtld_must_xmm_save) == 0) \
> 415 _dl_x86_64_restore_sse (); \
> 416 THREAD_SETMEM (THREAD_SELF, header.rtld_must_xmm_save, \
> 417 old_rtld_must_xmm_save); \
> 418 } while (0)
> 419 # endif
>
> This results in a SIGILL as T1 sees an uninitialized have_avx and attempts to
> issue avx restore instructions that the hardware doesn't support.
>
> How do we fix this? Atomic accesses to have_avx and the header.rtld_must_xmm_save?
> What else isn't safe with two threads using the same memory?
>
> Feels to me like the test case is invalid and should not attempt to clone a thread
> that glibc doesn't know about. However, this is apparently common in some low-level
> tools, so we may wish to try continue support what is done in this test case?
>
> Now, keep in mind the above is merely a hypothesis, but the SIGILL's are real:
>
> [root@intel-d3c4702-01 ~]# ./a.out
> new thread: 10435
> new thread: 10435
> pid = 10435
> Illegal instruction
>
> And reproducible, but only in this CLONE_VM case.
See:
https://sourceware.org/bugzilla/show_bug.cgi?id=11214
--
H.J.