This is the mail archive of the libc-hacker@sourceware.cygnus.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

gdb and linuxthreads (A deadlock in linuxthreads.)


> 
> Hello,
> 
> I agree there is something wrong in the way the thread manager handles
> terminated threads, but I'm not sure I follow your interpretation of
> the bug, nor your patch.
> 
> > The manager is waiting in the loop for
> > 	a. dead children.
> > 	b. request from children.
> > Now
> > 1. At some instance, a child exits.
> > 2. The manager wakes up in the loop and finds a dead child. It calls
> > pthread_reap_children () which in turns call pthread_exited () which
> > calls __pthread_lock (). In __pthread_lock (), there is
> > 	if (oldstatus != 0) suspend(self);
> > This time "oldstatus" is 1 and suspend(self) is called. Now the manager
> > thinks it has nothing to do and suspends itself. At the same time,
> > another child sends a REQ_CREATE message to the manager and the calls
> > suspend(self). Now both the manager and the child called suspend(self).
> > We get a dead lock here.
> 
> I don't see it.  The manager suspends itself because there is another
> thread that currently holds the lock for the terminated thread
> (i.e. a third thread doing a join or detach on the thread that has
> just terminated).  However, that third thread is going to release the
> lock eventually.  I mean, it cannot be the same thread that has just
> sent a request to the manager and is suspended.  If it holds the lock,

I have verified that there was no third thread at all. There were only
2 threads, the manager and the thread just sent a request to the
manager. It may be a race condition which can only happen on a SMP
machine.


> it's not suspended.  Releasing the lock will restart the manager,
> which then will process the request and restart the requesting thread.
> 
> BUT: there is something very wrong in using __pthread_sig_restart to
> signal dead children, because that signal is also used to restart the
> manager thread when it's waiting on a fastlock.  So, a child that dies
> while the manager is waiting for a fastlock will restart the manager
> prematurely.  In itself that wouldn't be incorrect, just inefficient.
> But the fastlock mechanism relies crucially on __pthread_sig_restart
> being blocked at all times except when waiting on a fastlock.
> (Otherwise, we get a race condition that leadds to lost wakeups.)
> And this is not the case in the thread manager, since the "dead
> children" signal must remain unblocked.  I strongly suspect the
> manager deadlock you've observed is due to such a lost wakeup.
> I totally missed this point when turning the spinlocks used in the
> original implementation into fast locks.

That is very possible.

> 
> The problem would be easy to fix... if it weren't for the gdb
> interface.  The right thing to do is have the children send
> __pthread_sig_cancel instead of __pthread_sig_restart when they die.
> Then, __pthread_manager_sighandler is called from the handler for
> __pthread_sig_cancel.  __pthread_sig_restart remains blocked by
> default in the manager thread just like in any other thread.
> 
> The problem is that the gdb interface uses a __pthread_sig_cancel sent
> to the thread manager as an indication that a new thread is created.
> (See the processing of REQ_DEBUG.)  I really don't know what happens
> if we send a __pthread_sig_cancel also when a thread dies.  Have you
> looked at the OpenGroup patches to gdb?  Not being too familiar with
> gdb myself, I don't fully understand what's going on.

Could you please get my gdb 4.17.0.6 from

ftp://ftp.kernel.org/pub/linux/devel/gcc

It has the OpenGroup linuxthread patch. But it only works with
glibc 2.0. I am enclosing a patch here against 4.17.0.6. It will
compile under glibc 2.1. But I run into a problem. Ulrich, could
you please please tell me why you added CLONE_PTRACE to __clone
call? I think that is one thing which breaks gdb.

> 
> I'll try to send you tomorrow patches that implement the approach
> above (use __pthread_sig_cancel to signal dead children).  Then maybe
> you could test them on a multiprocessor and see whether the problem
> with ex6.c remains.
> 

Thanks. I will try. BTW, I only see this dead lock about 1 out of
40 tries.

Thanks.


-- 
H.J. Lu (hjl@gnu.org)
----
Index: lnx-thread.c
===================================================================
RCS file: /home/work/cvs/gnu/gdb/gdb/lnx-thread.c,v
retrieving revision 1.2
diff -u -p -r1.2 lnx-thread.c
--- lnx-thread.c	1998/12/03 01:07:42	1.2
+++ lnx-thread.c	1998/12/19 18:48:53
@@ -403,6 +403,8 @@ stop_thread (pid)
 	  printf_unfiltered ("[New %s]\n", target_pid_to_str (pid));
 	add_thread (pid);
       }
+    else
+      perror_with_name ("ptrace in stop_thread");
 }
 
 /* Wait for a thread */
@@ -641,6 +643,19 @@ linuxthreads_new_objfile (objfile)
 			  "__pthread_sig_cancel");
       return;
     }
+
+#ifdef SIGRTMIN
+  if (!linuxthreads_sig_restart && !linuxthreads_sig_cancel)
+    {
+      linuxthreads_sig_restart = __libc_allocate_rtsig (1);
+      linuxthreads_sig_cancel = __libc_allocate_rtsig (1);
+      if (linuxthreads_sig_restart < 0 || linuxthreads_sig_cancel < 0)
+	{
+	  linuxthreads_sig_restart = LINUXTHREAD_SIG_EXIT;
+	  linuxthreads_sig_cancel = LINUXTHREAD_SIG_CANCEL;
+	}
+    }
+#endif
 
   if ((ms = lookup_minimal_symbol ("__pthread_threads_max",
 				   NULL, objfile)) == NULL


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]