This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
race condition in dl code
- From: Sripathi Kodi <sripathik at in dot ibm dot com>
- To: libc-alpha at sources dot redhat dot com
- Cc: dino at in dot ibm dot com
- Date: Wed, 6 Sep 2006 15:22:06 +0530
- Subject: race condition in dl code
- Reply-to: sripathik at in dot ibm dot com
Hi,
We are seeing a race condition in dl code resulting in our application
crashing with sigsegv. We have seen the problem only while running -rt
(CONFIG_PREEMPT_RT) kernels. We have seen the problem on glibc-2.4.90-17
from FC6 as well as latest glibc from cvs.
Backtrace of the application thread that crashes is as follows. It is a Java
based testcase.
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x4f14c546 in __nanosleep_nocancel () from /lib/libc.so.6
#2 0x4f14c37b in sleep () from /lib/libc.so.6
#3 0xb7e65724 in Hack_print_info_before_abort (whichCase=0xb7e71898 "CASE 1: not-SI_USER", signal=11, sigInfo=0xb7ee4e0c,
context=0xb7ee4e8c, portLibType=4, previousSignal=0x0) at j9signal.c:414
#4 0xb7e665dc in masterSynchSignalHandler (signal=11, sigInfo=0xb7ee4e0c, contextInfo=0xb7ee4e8c) at j9signal.c:1134
#5 <signal handler called>
#6 0x4ef694ad in do_lookup_x () from /lib/ld-linux.so.2
#7 0x4ef69883 in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#8 0x4ef6d8c4 in _dl_fixup () from /lib/ld-linux.so.2
#9 0x4ef72cb0 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#10 0xb7ee9058 in thread_wrapper (arg=0x805bd5c) at j9thread.c:933
#11 0x4b359f9a in start_thread () from /lib/libpthread.so.0
#12 0x4f18abae in clone () from /lib/libc.so.6
Our analysis:
_dl_fixup passes link_map structure (l) as well as l->l_scope to
_dl_lookup_symbol_x. In the link_map structure passed to fixup, the field
l_scope_mem[4] is the statically allocated memory for the first few
'r_scope_elem's. When this gets filled up, more r_scope_elems are allocated
and attached at 'l_scope'. Memory pointed by l_scope too can get exhausted.
When this happens, this memory is doubled using realloc in this part of
dl_open_worker() function:
else
{
newp = (struct r_scope_elem **)
realloc (imap->l_scope,
new_size * sizeof (struct r_scope_elem *));
if (newp == NULL)
_dl_signal_error (ENOMEM, "dlopen", NULL,
N_("cannot create scope list"));
imap->l_scope = newp;
}
imap->l_scope_max = new_size;
}
So this reallocs the memory for l_scope (doubling it's size) and points
l_scope to the new area of memory. Old memory is now free for reuse by
malloc code.
Sequence of events happening in our testcase is:
i) Thread A calls _dl_lookup_symbol_x from _dl_fixup, passing it 'l' and
'l->l_scope' as parameters. It is preempted before _dl_lookup_symbol_x runs.
ii) Thread B does (possibly many) dlopen()s, it exhausts the memory pointed
to by l_scope. Hence dl_open_worker() reallocs this memory.
iii) Memory previously allocated to l_scope is reused. However, thread A
still has a reference to this area of memory through l_scope passed to
_dl_lookup_symbol_x.
iv) Thread A tries to access memory pointed to by it's copy of l_scope,
hence gets a SEGV.
Occasionally, when it doesn't get a SEGV, thread A fails to resolve the
symbol and prints out a message like "undefined symbol: printf, version
GLIBC_2.0"
We feel that l_scope will have to be protected with a lock (preferably
read-write lock) to solve this problem. The idea is to make the thread doing
dlopen to hold a write lock on l_scope before doing realloc. This will
prevent memory being freed from under the feet of reader threads doing
symbol lookup holding read lock on l_scope. I can send such a patch if the
idea is acceptable. If the idea of introducing a lock is not good, could you
please suggest any other way to solve this problem?
Thanks and regards,
Sripathi Kodi.