Hi,
One of our application was getting terminated with the following error
"Inconsistency detected by ld.so: dl-open.c: 610: _dl_open: Assertion
`_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed!
with glibc-2.4.31. This race seems to be present in the libc I got from
the CVS [at code inspection]. We were able to reproduce this
consistently after running for 4-5hrs.
Upon debugging we found that it is due to a race between two threads
doing a _dl_open().
The scenario is something like this :
In elf/dl-open.c, _dl_open:
/* Make sure we are alone. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
[...]
int errcode = _dl_catch_error (&objname, &errstring, &malloced,
dl_open_worker, &args);
#ifndef MAP_COPY
/* We must munmap() the cache file. */
_dl_unload_cache ();
#endif
/* Release the lock. */
__rtld_lock_unlock_recursive (GL(dl_load_lock));
^^^^^ This would kick any other thread waiting on the lock.
if (__builtin_expect (errstring != NULL, 0))
{
[...]
}
assert (_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT);
And, if the thread which gets woken up is playing with the same
namespace, and sets the r_state to RT_ADD in _dl_map_object_from_fd even
before we reach here (truly possible in an SMP system), ( due to
getting scheduled out ), we would hit the assert !
In our case we found that the r_state was RT_ADD and which was set by
another thread.
So, it is not safe to believe that the r_state won't get changed once we
release the lock.
We have tried the following solution for this issue, and it works fine.
Solution: Ensure the consistency in case of a successful dl_open_worker,
before releasing the lock ! Our solution ignored the case where
dl_open_worker may fail.
Attached patch addresses all the issues and we think this may be a
complete solution for the problem.
1) For successful load by dl_open_worker or a failure where we don't
have to remove any object from memory ( i.e, we don't have to do a
_dl_close to ensure the consistency ), the assertion is performed before
unlocking.
2) For the latter case, where we have to do a _dl_close(), the assertion
is done just after the _dl_close.
Comments ?
Thanks,
Suzuki K P <suzuki@in.ibm.com>
Linux Technology Center,
IBM Systems & Technology Labs.
------------------------------------------------------------------------
Index: glibc-2.4/elf/dl-open.c
===================================================================
--- glibc-2.4.orig/elf/dl-open.c 2006-10-03 11:10:22.000000000 -0700
+++ glibc-2.4/elf/dl-open.c 2006-10-10 15:28:33.000000000 -0700
@@ -554,6 +554,16 @@
_dl_unload_cache ();
#endif
+ /* Check the consistency state, for a successful load, before we release
+ the lock. Somebody else might come up and modify the states once we
+ release this lock. In case of the failures, where we need a _dl_close,
+ consistency is achieved only after the _dl_close. So for that case
+ we delay the check until we finish _dl_close. */
+ if (__builtin_expect (errstring == NULL, 1)
+ || args.map == NULL)
+ assert (_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT);
+
+
/* Release the lock. */
__rtld_lock_unlock_recursive (GL(dl_load_lock));
@@ -564,7 +574,8 @@
size_t len_errstring;
/* Remove the object from memory. It may be in an inconsistent
- state if relocation failed, for example. */
+ state if relocation failed, for example. Enusre the conistent
+ state. */
if (args.map)
{
#ifdef USE_TLS
@@ -580,6 +591,7 @@
#endif
_dl_close (args.map);
+ assert (_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT);
}
/* Make a local copy of the error string so that we can release the
@@ -601,13 +613,9 @@
if (malloced)
free ((char *) errstring);
- assert (_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT);
-
/* Reraise the error. */
_dl_signal_error (errcode, objname, NULL, local_errstring);
- }
-
- assert (_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT);
+ }
#ifndef SHARED
DL_STATIC_INIT (args.map);