This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Possible hang in pthread_cond_broadcast()

From: "Kuchta, Martin" <Martin dot Kuchta at netapp dot com>
To: "libc-help at sourceware dot org" <libc-help at sourceware dot org>
Cc: "Kuchta, Martin" <Martin dot Kuchta at netapp dot com>
Date: Wed, 15 Aug 2018 22:13:10 +0000
Subject: Possible hang in pthread_cond_broadcast()

Hi,

After upgrading from glibc 2.23 to 2.26, we've been seeing what looks like a hang inside pthread_cond_broadcast calls in our application. Based on inspecting core dumps from the hung processes and the nptl source, I have some speculation about where the issue might be. Since I have no prior experience with this code, I'm hoping to get a sanity check on my understanding of the potential problem before jumping straight to filing a bug report.

Ideally, we'd like to have a standalone reproducible test, but our efforts so far to develop one have been unsuccessful. The problem only seems to show itself in our application very rarely (once per 15-24 hours) and only under very heavy load.

We are also using a version of glibc with Gentoo patches (glibc-2.26-r7), but the sections of code I'm looking at are identical to the current upstream master branch (unchanged since the condition variable rewrite in 2016). 
Our libc was built with:
    gcc-7.3.0

    CFLAGS "-march=westmere -pipe -O2 -fno-strict-aliasing", and we're running on a 4.14.49 Linux kernel.



With all that said, here is what I'm observing:

At the time our application's hang detector kicks in and kills the process, these are always the top frames of the hung thread's stack:

#0  0x00007ff8ef32320c in futex_wait (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/nptl/futex-internal.h:135
#2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7ff8e7433a58) at pthread_cond_common.c:412
#3  __pthread_cond_broadcast (cond=cond@entry=0x7ff8e7433a58) at pthread_cond_broadcast.c:73


And the actual condition variable state always looks something like this:

{__data = {{__wseq = 212413044, __wseq32 = {__low = 212413044, __high = 0}}, {__g1_start = 212413034, __g1_start32 = {__low = 212413034, __high = 0}}, __g_refs = {2, 0}, __g_size = {0, 0}, __g1_orig_size = 17, __wrefs = 10, __g_signals = {0, 5}}, 
  __size = "t*\251\f\000\000\000\000j*\251\f\000\000\000\000\002", '\000' <repeats 15 times>, "\021\000\000\000\n\000\000\000\000\000\000\000\005\000\000", __align = 212413044}


The hung thread is always in a futex_wait call with futex_word being the address of one of the elements of __g_refs and expected=2. The actual value at futex_word is always 0.



And here is what I think might be happening:

In this loop within __condvar_quiesce_and_switch_g1, the broadcasting thread is supposed to be waiting for all the existing waiters to leave g1:

  while ((r >> 1) > 0)
    {
      for (unsigned int spin = maxspin; ((r >> 1) > 0) && (spin > 0); spin--)
	{
	  /* TODO Back off.  */
	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
	}
      if ((r >> 1) > 0)
	{
	  /* There is still a waiter after spinning.  Set the wake-request
	     flag and block.  Relaxed MO is fine because this is just about
	     this futex word.  */
	  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1);

	  if ((r >> 1) > 0)
	    futex_wait_simple (cond->__data.__g_refs + g1, r, private);
	  /* Reload here so we eventually see the most recent value even if we
	     do not spin.   */
	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
	}
    }


In a normal situation where there is still a waiter holding a reference, I think the flow could go something like this:

The atomic-or reads the value 2 and writes 3
futex_wait_simple is called with expected=2 and immediately returns because the actual value of 3 is not the same as the expected
(There is still a waiter, so the loop repeats)
The atomic-or reads the value 3 and writes 3
futex_wait_simple is called with expected=3 and blocks
At some point, the last waiter releases its reference with __condvar_dec_grefs, which sees the value of 3 for the reference count, unsets the wake-request flag, and calls futex_wake to unblock the broadcasting thread:

static void
__condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int private)
{
  /* Release MO to synchronize-with the acquire load in
     __condvar_quiesce_and_switch_g1.  */
  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
    {
      /* Clear the wake-up request flag before waking up.  We do not need more
	 than relaxed MO and it doesn't matter if we apply this for an aliased
	 group because we wake all futex waiters right after clearing the
	 flag.  */
      atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned int) 1);
      futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
    }
}


In my situation, the unexpected piece seems to be that futex_wait is blocking on a value without the wake-request flag set. This indicates to me that something must have cleared the wake-request flag between the atomic-or and futex_wait from the broadcasting thread. The only place I see the wake-request flag being cleared is in __condvar_dec_grefs.

The most reasonable explanation I've been able to come up with is that the last waiter from the _previous_ time this group was g1 decremented the reference count in __condvar_dec_grefs and entered the if-block, but was preempted before clearing the wake-request flag. Because the reference count was decremented, the broadcast was able to complete. Another broadcast then happened, swapping the active group and making the original group g1 again. On the broadcast after that, the old waiter thread resumed and cleared the wake-request flag right between the broadcaster reading the value 2 from the reference count and calling futex_wait, causing the futex_wait to block. The last legitimate waiter on g1 at that point would not call futex_wake when releasing its reference since the wake-request flag would not be set, leaving the broadcasting thread blocked indefinitely.

I did manage to produce a scenario like this in a manual test involving multiple processes broadcasting and waiting on a shared memory condition variable, using gdb to pause and resume processes in the right places. The hang I observed in that experiment looked the same as the one our real system is hitting. Manually controlling the processes makes this a bit less convincing, but it does seem to add a bit more evidence that a hang in pthread_cond_broadcast is possible by just altering timing.



My feeling is that the futex_wait call in __condvar_quiesce_and_switch_g1 should be made with the expected value being the one just written to __g_refs (with the wake-request flag set), instead of the one read. This would guarantee that if the futex_wait sleeps, then the wake-request flag is set. If another thread comes in and clears the wake-request flag before the futex_wait call, the values will no longer match and it will simply return.


Please let me know if there are any obvious misunderstandings here, or if there is extra information we could provide about the issue. We're going to try the proposed fix internally to see if it makes a difference - initial tests look somewhat promising.

Thanks,
Martin

Follow-Ups:
- Re: Possible hang in pthread_cond_broadcast()
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]