This is the mail archive of the
gdb-patches@sources.redhat.com
mailing list for the GDB project.
Re: Race condition in sol-thread.c
- From: Michael Snyder <msnyder at redhat dot com>
- To: Hilfinger at otisco dot mckusick dot com
- Cc: brobecker at act-europe dot com, gdb-patches at sources dot redhat dot com
- Date: Thu, 29 Aug 2002 15:06:58 -0700
- Subject: Re: Race condition in sol-thread.c
- Organization: Red Hat, Inc.
- References: <200207051039.DAA19929@otisco.McKusick.COM>
"Paul N. Hilfinger" wrote:
>
> Mike,
>
> We have encountered difficulties with multithreaded programs on Solaris
> when an LWP terminates immediately after a 'continue'. Apparently, the
> problem occurs during a call to p_td_ta_map_id2thr (we can
> nondeterministically encounter either correct behavior, an infinite loop, or
> one of a couple of error messages, such as "no thread can be found to
> satisfy request"). I believe I know why and solicit your comments.
Paul, first of all, please forgive me for taking so long
to get back to you. I've been a little bit snowed under.
> The following code appears in sol_thread_wait:
>
> inferior_ptid = thread_to_lwp (inferior_ptid, PIDGET (main_ph.ptid));
> if (PIDGET (inferior_ptid) == -1)
> inferior_ptid = procfs_first_available ();
>
> if (PIDGET (ptid) != -1)
> {
> ptid_t save_ptid = ptid;
>
> ptid = thread_to_lwp (ptid, -2);
> if (PIDGET (ptid) == -2) /* Inactive thread */
> error ("This version of Solaris can't start inactive threads.");
> if (info_verbose && PIDGET (ptid) == -1)
> warning ("Specified thread %ld seems to have terminated",
> GET_THREAD (save_ptid));
> }
>
> rtnval = procfs_ops.to_wait (ptid, ourstatus);
>
> Now thread_to_lwp eventually calls p_td_ta_map_id2thr, which then
> eventually calls ps_pdread. The problem is (or appears to be) that
> these calls happen before the inferior is known to be stopped (given
> the placement of the to_wait call). Comments in sol-thread suggest
> that the thread_db library expects the inferior to be stopped during its
> queries, but that since GDB is supposed to guarantee this state, the
> routines ps_pstop, etc., are nops.
Yep -- you're right. Calling thread_to_lwp is potentially
(though not certainly) going to read memory, and we're not
allowed to read memory until the target is stopped. This is
a bug.
The weird thing is, it's a bug that's been in there since
version 1.1 of sol-thread.c, written back in 1996. I can't
say it's NEVER bitten anybody, but it must require a special
circumstance.
The problem is that the code you want to remove is there
for a reason. I actually added that code in 1997, and you're
probably right that I cribbed it from sol_thread_resume.
Its purpose was specifically to deal with a thread that had
exited. There's probably a more correct way to do it, but
removing it entirely isn't it.
Michael
> A little instrumentation inserted in sol-thread.c and procfs.c showed that,
> sure enough, as a result of the code above, the thread_db package makes
> a call to ps_pstop and then to ps_pdread before there is any attempt in
> procfs to stop the process. This suggests strongly that the precondition
> on ps_pdread is being violated.
>
> On examining the code, it seems as if all the lines quoted above, except
> for the last, are unnecessary (it is suspicious that they refer to STARTING
> a thread, and are a copy of lines from sol_thread_resume). I am not
> sure, because all these global variables confuse me. In any case, removing
> those lines fixed our problem and apparently introduced no regressions.
>
> There is definitely a race condition that needs fixed. What's YOUR take on
> the right approach?
>
> Paul Hilfinger
>
> ----------------------------------------------------------------------
>
> 2002-07-05 Paul N. Hilfinger <hilfingr@otisco.mckusick.com>
>
> * sol-thread.c (sol_thread_wait): Remove code (apparently duplicated
> from sol_thread_resume) that indirectly causes a race condition,
> because thread_to_lwp reads the inferior's memory while the
> inferior is not guaranteed to be stopped.
>
> Index: gdb5_0.27/gdb/sol-thread.c
> --- gdb5_0.27/gdb/sol-thread.c Thu, 24 Jan 2002 00:35:07 -0800 hilfingr (Gdb/S/50_sol-thread 1.1.1.1.1.1.1.2.1.1.1.3.1.2.1.1.1.2 644)
> +++ ada-gdb5-0-merge.117(w)/gdb/sol-thread.c Fri, 05 Jul 2002 01:34:00 -0700 hilfingr (Gdb/S/50_sol-thread 1.1.1.1.1.1.1.2.1.1.1.3.1.2.1.1.1.5 644)
> @@ -493,22 +504,6 @@ sol_thread_wait (ptid_t ptid, struct tar
>
> save_ptid = inferior_ptid;
> old_chain = save_inferior_ptid ();
> -
> - inferior_ptid = thread_to_lwp (inferior_ptid, PIDGET (main_ph.ptid));
> - if (PIDGET (inferior_ptid) == -1)
> - inferior_ptid = procfs_first_available ();
> -
> - if (PIDGET (ptid) != -1)
> - {
> - ptid_t save_ptid = ptid;
> -
> - ptid = thread_to_lwp (ptid, -2);
> - if (PIDGET (ptid) == -2) /* Inactive thread */
> - error ("This version of Solaris can't start inactive threads.");
> - if (info_verbose && PIDGET (ptid) == -1)
> - warning ("Specified thread %ld seems to have terminated",
> - GET_THREAD (save_ptid));
> - }
>
> rtnval = procfs_ops.to_wait (ptid, ourstatus);
>