This is the mail archive of the gdb-patches@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: [PATCH RFC] linux threads: prevent starvation


Hi Mark, 

Have you had a chance to look at this yet?  I am attaching a new
version of the diffs.  Only one line is different from the previous
diff, but it's an important bug fix over the original patch.

Michael

> OK, this deserves some explanation.
> 
> Problem: Debugging with GDB can cause some threads to starve
> in a multi-threaded Linux program.
> 
> Testcase: gdb/testsuite/gdb.threads/pthreads.exp
> FAIL: gdb.threads/pthreads.exp: Some threads didn't run.
> 
> Cause: In lin_lwp_wait, we call waitpid (-1, __WCLONE).
> If more than one LWP is currently stopped at a breakpoint,
> the highest-numbered one will be returned.  All the others
> will have their breakpoints pushed back.  If we then continue,
> and the highest-numbered LWP has time to hit a breakpoint
> again, no one else will have a chance to run.
> 
> Solution: Instead of always accepting the event for the
> first LWP returned by waitpid, use a monte carlo method
> to select among all LWPs that have breakpoint events.
> 
> Implementation: I've made use of the lwp list and saved status
> that Mark has already implemented.  I've saved the SIGTRAP events
> along with any other events accidentally fetched by stop_wait_callback,
> and I've added the event fetched by lin_lwp_wait to the same queue.
> Then I've added code that examines the list of fetched SIGTRAP events
> and chooses one at random, making that LWP the event LWP and returning
> its event to gdb.
> 
> This passes the testsuite (where the old method fails), and I've
> done extensive hand testing as well.  In fact, this has exposed
> a number of bugs in core gdb's thread handling, some of which I've
> already submitted patches for.  Others are still to come.  With
> all of these patches in place, this code seems to be quite robust.
> I can put a breakpoint in code that is shared by several threads,
> and keep them all stepping and nexting indefinitely.
>
2001-06-12  Michael Snyder  <msnyder@redhat.com>

	* lin-lwp.c: Prevent thread starvation by using a monte carlo 
	method to choose which of several event threads to handle next.

	(resume_callback): Disable code that attempts to handle
	step_resume breakpoints.  Let core gdb handle this.
	(stop_wait_callback): Defer pushback of breakpoint events until
	later; add SIGTRAP events to the queue of unhandled events.
	Keep calling waitpid until SIGSTOP retrieved.  If more than one
	non-SIGSTOP event is retrieved, push them back onto the process
	queue using kill.
	(count_events_callback, select_singlestep_lwp_callback, 
	select_event_lwp_callback, cancel_breakpoints_callback, 
	select_event_lwp): New functions.  Implement monte carlo method 
	for selecting which of several SIGTRAP threads to handle next.  
	Push back the breakpoint event for all threads other than the 
	selected one.
	(lin_lwp_wait): Call select_event_lwp to decide which of several
	sigtrapped lwps to handle next.
	
Index: lin-lwp.c
===================================================================
RCS file: /cvs/src/src/gdb/lin-lwp.c,v
retrieving revision 1.24
diff -c -3 -p -r1.24 lin-lwp.c
*** lin-lwp.c	2001/06/07 19:31:10	1.24
--- lin-lwp.c	2001/06/27 02:07:25
*************** resume_callback (struct lwp_info *lp, vo
*** 458,464 ****
      {
        struct thread_info *tp;
  
! #if 1
        /* FIXME: kettenis/2000-08-26: This should really be handled
           properly by core GDB.  */
  
--- 458,464 ----
      {
        struct thread_info *tp;
  
! #if 0
        /* FIXME: kettenis/2000-08-26: This should really be handled
           properly by core GDB.  */
  
*************** stop_wait_callback (struct lwp_info *lp,
*** 585,591 ****
        pid_t pid;
        int status;
  
-     get_another_event:
        gdb_assert (lp->status == 0);
  
        pid = waitpid (GET_LWP (lp->ptid), &status,
--- 585,590 ----
*************** stop_wait_callback (struct lwp_info *lp,
*** 619,631 ****
  	}
  
        gdb_assert (WIFSTOPPED (status));
-       lp->stopped = 1;
  
        if (WSTOPSIG (status) != SIGSTOP)
  	{
! 	  if (WSTOPSIG (status) == SIGTRAP
! 	      && breakpoint_inserted_here_p (read_pc_pid (pid_to_ptid (pid))
! 					     - DECR_PC_AFTER_BREAK))
  	    {
  	      /* If a LWP other than the LWP that we're reporting an
                   event for has hit a GDB breakpoint (as opposed to
--- 618,627 ----
  	}
  
        gdb_assert (WIFSTOPPED (status));
  
        if (WSTOPSIG (status) != SIGSTOP)
  	{
! 	  if (WSTOPSIG (status) == SIGTRAP)
  	    {
  	      /* If a LWP other than the LWP that we're reporting an
                   event for has hit a GDB breakpoint (as opposed to
*************** stop_wait_callback (struct lwp_info *lp,
*** 640,660 ****
  		 user will delete or disable the breakpoint, but the
  		 thread will have already tripped on it.  */
  
- 	      if (debug_lin_lwp)
- 		fprintf_unfiltered (gdb_stdlog,
- 				    "Tripped breakpoint at %lx in LWP %d"
- 				    " while waiting for SIGSTOP.\n",
- 				    (long) read_pc_pid (lp->ptid), pid);
- 
- 	      /* Set the PC to before the trap.  */
- 	      if (DECR_PC_AFTER_BREAK)
- 		write_pc_pid (read_pc_pid (pid_to_ptid (pid)) 
- 		                - DECR_PC_AFTER_BREAK,
- 			      pid_to_ptid (pid));
- 
  	      /* Now resume this LWP and get the SIGSTOP event. */
  	      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! 	      goto get_another_event;
  	    }
  	  else if (WSTOPSIG (status) == SIGINT &&
  		   signal_pass_state (SIGINT) == 0)
--- 636,657 ----
  		 user will delete or disable the breakpoint, but the
  		 thread will have already tripped on it.  */
  
  	      /* Now resume this LWP and get the SIGSTOP event. */
  	      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! 	      if (debug_lin_lwp)
! 		{
! 		  fprintf_unfiltered (gdb_stderr, 
! 				      "SWC: Candidate SIGTRAP event in %ld\n",
! 				      GET_LWP (lp->ptid));
! 		}
! 	      /* Hold the SIGTRAP for handling by lin_lwp_wait. */
! 	      stop_wait_callback (lp, data);
! 	      /* If there's another event, throw it back into the queue. */
! 	      if (lp->status)
! 		kill (GET_LWP (lp->ptid), WSTOPSIG (lp->status));
! 	      /* Save the sigtrap event. */
! 	      lp->status = status;
! 	      return 0;
  	    }
  	  else if (WSTOPSIG (status) == SIGINT &&
  		   signal_pass_state (SIGINT) == 0)
*************** stop_wait_callback (struct lwp_info *lp,
*** 664,690 ****
  		 just ignore all SIGINT events from all lwp's except for
  		 the one that was caught by lin_lwp_wait.  */
  
! 	      /* Now resume this LWP and get the SIGSTP event. */
  	      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! 	      goto get_another_event;
  	    }
  	  else
  	    {
  	      if (debug_lin_lwp)
! 		fprintf_unfiltered (gdb_stdlog, 
! 				    "Received %s in LWP %d while waiting for SIGSTOP.\n",
! 				    strsignal (WSTOPSIG (status)), pid);
  
! 	      /* The thread was stopped with a signal other than
! 		 SIGSTOP, and didn't accidentally trip a breakpoint.
! 		 Record the wait status.  */
! 	      lp->status = status;
  	    }
  	}
        else
  	{
  	  /* We caught the SIGSTOP that we intended to catch, so
               there's no SIGSTOP pending.  */
  	  lp->signalled = 0;
  	}
      }
--- 661,702 ----
  		 just ignore all SIGINT events from all lwp's except for
  		 the one that was caught by lin_lwp_wait.  */
  
! 	      /* Now resume this LWP and get the SIGSTOP event. */
  	      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! 	      return stop_wait_callback (lp, data);
  	    }
  	  else
  	    {
+ 	      /* The thread was stopped with a signal other than
+ 		 SIGSTOP, and didn't accidentally trip a breakpoint. */
+ 
  	      if (debug_lin_lwp)
! 		{
! 		  fprintf_unfiltered (gdb_stderr, 
! 				      "SWC: Pending event %d in %ld\n",
! 				      WSTOPSIG (status), GET_LWP (lp->ptid));
! 		}
! 	      /* Now resume this LWP and get the SIGSTOP event. */
! 	      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
  
! 	      /* Hold this event/waitstatus while we check to see if
! 		 there are any more (we still want to get that SIGSTOP). */
! 	      stop_wait_callback (lp, data);
! 	      /* If the lp->status field is still empty, use it to hold
! 		 this event.  If not, then this event must be returned
! 		 to the event queue of the LWP.  */
! 	      if (lp->status == 0)
! 		lp->status = status;
! 	      else
! 		kill (GET_LWP (lp->ptid), WSTOPSIG (status));
! 	      return 0;
  	    }
  	}
        else
  	{
  	  /* We caught the SIGSTOP that we intended to catch, so
               there's no SIGSTOP pending.  */
+ 	  lp->stopped = 1;
  	  lp->signalled = 0;
  	}
      }
*************** running_callback (struct lwp_info *lp, v
*** 710,715 ****
--- 722,862 ----
    return (lp->stopped == 0);
  }
  
+ /* Count the LWP's that have had events. */
+ 
+ static int
+ count_events_callback (struct lwp_info *lp, void *data)
+ {
+   int *count = data;
+ 
+   /* Count only threads that have a SIGTRAP pending. */
+   if (lp->status != 0 &&
+       WIFSTOPPED (lp->status) && 
+       WSTOPSIG (lp->status) == SIGTRAP &&
+       count != NULL)	/* paranoia */
+     (*count)++;
+ 
+   return 0;
+ }
+ 
+ /* Select the LWP (if any) that is currently being single-stepped. */
+ 
+ static int
+ select_singlestep_lwp_callback (struct lwp_info *lp, void *data)
+ {
+   if (lp->step && lp->status != 0)
+     return 1;
+   else
+     return 0;
+ }
+ 
+ /* Select the Nth LWP that has had a SIGTRAP event. */
+ 
+ static int
+ select_event_lwp_callback (struct lwp_info *lp, void *data)
+ {
+   int *selector = data;
+ 
+   /* Select only threads that have a SIGTRAP event pending. */
+   if (lp->status != 0 &&
+       WIFSTOPPED (lp->status) &&
+       WSTOPSIG (lp->status) == SIGTRAP &&
+       selector != NULL)	/* paranoia */
+     if ((*selector)-- == 0)
+       return 1;
+ 
+   return 0;
+ }
+ 
+ static int
+ cancel_breakpoints_callback (struct lwp_info *lp, void *data)
+ {
+   struct lwp_info *event_lp = data;
+ 
+   if (lp != event_lp &&
+       lp->status != 0 &&
+       WIFSTOPPED (lp->status) &&
+       WSTOPSIG (lp->status) == SIGTRAP &&
+       breakpoint_inserted_here_p (read_pc_pid (lp->ptid) - 
+ 				  DECR_PC_AFTER_BREAK))
+     {
+       if (debug_lin_lwp)
+ 	{
+ 	  fprintf_unfiltered (gdb_stdlog, 
+ 			      "Push back BP for %ld\n",
+ 			      GET_LWP (lp->ptid));
+ 	}
+       /* For each lp except the event lp, if there was a trap,
+ 	 set the PC to before the trap. */
+       if (DECR_PC_AFTER_BREAK)
+ 	{
+ 	  write_pc_pid (read_pc_pid (lp->ptid) - DECR_PC_AFTER_BREAK, 
+ 			lp->ptid);
+ 	}
+       lp->status = 0;
+     }
+   return 0;
+ }
+ 
+ /* Select one LWP out of those that have events to be the event thread. */
+ 
+ static void
+ select_event_lwp (struct lwp_info **orig_lp, int *status)
+ {
+   int num_events = 0;
+   int random_selector;
+   struct lwp_info *event_lp;
+ 
+   /* Give preference to any LWP that is being single-stepped. */
+   event_lp = iterate_over_lwps (select_singlestep_lwp_callback, NULL);
+   if (event_lp != NULL)
+     {
+       if (debug_lin_lwp)
+ 	fprintf_unfiltered (gdb_stdlog, 
+ 			    "Select singlestep lwp %ld\n", 
+ 			    GET_LWP (event_lp->ptid));
+     }
+   else
+     {
+       /* No single-stepping LWP.  Select one at random, out of those
+ 	 which have had SIGTRAP events. */
+ 
+       /* First see how many SIGTRAP events we have. */
+       iterate_over_lwps (count_events_callback, (void *) &num_events);
+ 
+       /* OK, now randomly pick the Nth LWP of those that have had SIGTRAP. */
+       random_selector = (int) 
+ 	((num_events * (double) rand ()) / (RAND_MAX + 1.0));
+ 
+       if (debug_lin_lwp)
+ 	{
+ 	  if (num_events > 1)
+ 	    fprintf_unfiltered (gdb_stdlog, 
+ 				"Found %d SIGTRAP events, selecting #%d\n", 
+ 				num_events, random_selector);
+ 	  else if (num_events <= 0)
+ 	    fprintf_unfiltered (gdb_stdlog, 
+ 				"ERROR select_event_lwp %d events!\n", 
+ 				num_events);
+ 	}
+ 
+       event_lp =  iterate_over_lwps (select_event_lwp_callback, 
+ 				     (void *) &random_selector);
+     }
+ 
+   if (event_lp != NULL)
+     {
+       /* "event_lp" is now the selected event thread.
+ 	 If any other threads have had SIGTRAP events, these events
+ 	 must now be cancelled, so that the respective thread will
+ 	 trip the breakpoint again once it is resumed.  */
+       iterate_over_lwps (cancel_breakpoints_callback, (void *) event_lp);
+       *orig_lp = event_lp;
+       *status  = event_lp->status;
+       event_lp->status = 0;
+     }
+ }
+ 
  /* Return non-zero if LP has been resumed.  */
  
  static int
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 741,757 ****
    /* First check if there is a LWP with a wait status pending.  */
    if (pid == -1)
      {
!       /* Any LWP will do.  */
        lp = iterate_over_lwps (status_callback, NULL);
        if (lp)
  	{
! 	  if (debug_lin_lwp)
  	    fprintf_unfiltered (gdb_stdlog, 
! 				"Using pending wait status for LWP %ld.\n",
  				GET_LWP (lp->ptid));
  
- 	  status = lp->status;
- 	  lp->status = 0;
  	}
  
        /* But if we don't fine one, we'll have to wait, and check both
--- 888,907 ----
    /* First check if there is a LWP with a wait status pending.  */
    if (pid == -1)
      {
!       /* Any LWP that's been resumed will do.  */
        lp = iterate_over_lwps (status_callback, NULL);
        if (lp)
  	{
! 	  status = lp->status;
! 	  lp->status = 0;
! 	  if (debug_lin_lwp && status)
  	    fprintf_unfiltered (gdb_stdlog, 
! 				"Using pending wait status %d for LWP %ld.\n",
! 				WIFSTOPPED (status) ? WSTOPSIG (status) : 
! 				WIFSIGNALED (status) ? WTERMSIG (status) :
! 				WEXITSTATUS (status), 
  				GET_LWP (lp->ptid));
  
  	}
  
        /* But if we don't fine one, we'll have to wait, and check both
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 772,782 ****
        status = lp->status;
        lp->status = 0;
  
!       if (debug_lin_lwp)
! 	if (status)
! 	  fprintf_unfiltered (gdb_stdlog, 
! 			      "Using pending wait status for LWP %ld.\n",
! 			      GET_LWP (lp->ptid));
  
        /* If we have to wait, take into account whether PID is a cloned
           process or not.  And we have to convert it to something that
--- 922,934 ----
        status = lp->status;
        lp->status = 0;
  
!       if (debug_lin_lwp && status)
! 	fprintf_unfiltered (gdb_stdlog, 
! 			    "Using pending wait status %d for LWP %ld.\n",
! 			    WIFSTOPPED (status) ? WSTOPSIG (status) : 
!  			    WIFSIGNALED (status) ? WTERMSIG (status) :
!  			    WEXITSTATUS (status), 
! 			    GET_LWP (lp->ptid));
  
        /* If we have to wait, take into account whether PID is a cloned
           process or not.  And we have to convert it to something that
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 947,952 ****
--- 1099,1111 ----
    /* This LWP is stopped now.  */
    lp->stopped = 1;
  
+   /* MVS: so save its event_status. */
+   lp->status = status;
+   if (debug_lin_lwp)
+     fprintf_unfiltered (gdb_stdlog, 
+ 			"LLW: Candidate event %d in %ld\n",
+ 			WSTOPSIG (status), GET_LWP (lp->ptid));
+ 
    /* Now stop all other LWP's ...  */
    iterate_over_lwps (stop_callback, NULL);
  
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 954,964 ****
       longer running.  */
    iterate_over_lwps (stop_wait_callback, NULL);
  
    /* If we're not running in "threaded" mode, we'll report the bare
       process id.  */
  
    if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
!     trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
    else
      trap_ptid = null_ptid;
  
--- 1113,1135 ----
       longer running.  */
    iterate_over_lwps (stop_wait_callback, NULL);
  
+   /* MVS Now choose an event thread from among those that
+      have had events.  Giving equal priority to all threads
+      that have had events helps prevent starvation. */
+ 
+   select_event_lwp (&lp, &status);
+ 
    /* If we're not running in "threaded" mode, we'll report the bare
       process id.  */
  
    if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
!     {
!       trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
!       if (debug_lin_lwp)
! 	fprintf_unfiltered (gdb_stdlog, 
! 			    "LLW: trap_ptid is %ld\n",
! 			    GET_LWP (trap_ptid));
!     }
    else
      trap_ptid = null_ptid;
  

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]