This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: dead-lock in glibc

From: Torvald Riegel <triegel at redhat dot com>
To: Carlos O'Donell <carlos at systemhalted dot org>
Cc: jkraehemann-guest at users dot alioth dot debian dot org, "libc-help at sourceware dot org" <libc-help at sourceware dot org>
Date: Tue, 28 Mar 2017 10:48:49 +0200
Subject: Re: dead-lock in glibc
Authentication-results: sourceware.org; auth=none
Authentication-results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=triegel at redhat dot com
Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com D573E41A2D
Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com D573E41A2D
References: <CA+Owze40Onq_uZs2wOjY=O5Xv3D75Ce_b7Sf5qEjMZ-bAnW_wA@mail.gmail.com> <CAE2sS1gXkrLAZf2o54QSkE_fqFMrSd987nP=QYRe=GQEdq26_w@mail.gmail.com> <CA+Owze6vtqJ4jURD2H4fouw5izePVaQ9iun2LCLQ+HqwVvkvWw@mail.gmail.com> <CAE2sS1iF1ua0w9379zm-nMToTxQfVJfTxa78uMgs6z=LEqy5GA@mail.gmail.com>

On Wed, 2017-03-15 at 21:54 -0400, Carlos O'Donell wrote:
> On Wed, Mar 15, 2017 at 4:35 PM, Joël Krähemann <jkraehemann@gmail.com> wrote:
> > * libc6 2.24-9
> 
> > Might be I was trying to do a recursive lock on a non-recursive mutex?
> > I was playing 64 beats with the notation editor of GSequencer in a infinite
> > loop. Suddenly it aborted after some playbacka approximetaly 3 to 4 minutes.
> 
> No. The asserts are intended to indicate internal consistency is violated.
> 
> Recursively locking a non-recursive mutex should lead to the thread
> getting stuck forever, but not an assert.
> 
> >>> gsequencer: ../nptl/pthread_mutex_lock.c:349:
> >>> __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e,
> >>> __err) != EDEADLK || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind !=
> >>> PTHREAD_MUTEX_RECURSIVE_NP)' failed.
> >>> Aborted
> 
> We've had a failure in the futex syscall, but that should not by
> itself trigger an assert.
> 
> The failure was either "no thread found" or "deadlock".
> 
> The assert triggers when we get "deadlock" from the kernel but the
> mutex was error-checking or recursive. Internally we don't ever expect
> to get "deadlock" from the kernel for these kinds of mutexes and
> indicates an algorithmic problem.
> 
> It's an algorithmic problem because earlier code should have detected
> we owned the mutex in the recursive case, bumped the ownership
> counter, and returned zero.
> 
> It's an algorithmic problem because earlier code should have detected
> we owned the mutex in the error checking case, and should have
> returned EDEADLK without making any futex syscalls.
> 
> So we didn't own the mutex and an attempt to acquire it determined it
> was locked by someone else (not us), and then the kernel returned
> EDEADLK, which doesn't make sense because we didn't own it to begin
> with!
> 
> It points to a kernel or glibc issue with PI mutexes.

It may, but not necessarily.  For example, the load of __lock that
handles the recursive/error-checking case is a separate access from the
CAS, so something else may have changed __lock in the meantime (eg, a
bug in the application).

A reproducer would be really helpful.  If we can't get this, we'd at
least need some information about the affected mutex: the kind of the
mutex, how it's used by the program, etc.

References:
- dead-lock in glibc
  - From: Joël Krähemann
- Re: dead-lock in glibc
  - From: Carlos O'Donell
- Re: dead-lock in glibc
  - From: Joël Krähemann
- Re: dead-lock in glibc
  - From: Carlos O'Donell

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]