This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
Re: dead-lock in glibc
- From: Torvald Riegel <triegel at redhat dot com>
- To: Carlos O'Donell <carlos at systemhalted dot org>
- Cc: jkraehemann-guest at users dot alioth dot debian dot org, "libc-help at sourceware dot org" <libc-help at sourceware dot org>
- Date: Tue, 28 Mar 2017 10:48:49 +0200
- Subject: Re: dead-lock in glibc
- Authentication-results: sourceware.org; auth=none
- Authentication-results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
- Authentication-results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=triegel at redhat dot com
- Dkim-filter: OpenDKIM Filter v2.11.0 mx1.redhat.com D573E41A2D
- Dmarc-filter: OpenDMARC Filter v1.3.2 mx1.redhat.com D573E41A2D
- References: <CA+Owze40Onq_uZs2wOjY=O5Xv3D75Ce_b7Sf5qEjMZ-bAnW_wA@mail.gmail.com> <CAE2sS1gXkrLAZf2o54QSkE_fqFMrSd987nP=QYRe=GQEdq26_w@mail.gmail.com> <CA+Owze6vtqJ4jURD2H4fouw5izePVaQ9iun2LCLQ+HqwVvkvWw@mail.gmail.com> <CAE2sS1iF1ua0w9379zm-nMToTxQfVJfTxa78uMgs6z=LEqy5GA@mail.gmail.com>
On Wed, 2017-03-15 at 21:54 -0400, Carlos O'Donell wrote:
> On Wed, Mar 15, 2017 at 4:35 PM, Joël Krähemann <jkraehemann@gmail.com> wrote:
> > * libc6 2.24-9
>
> > Might be I was trying to do a recursive lock on a non-recursive mutex?
> > I was playing 64 beats with the notation editor of GSequencer in a infinite
> > loop. Suddenly it aborted after some playbacka approximetaly 3 to 4 minutes.
>
> No. The asserts are intended to indicate internal consistency is violated.
>
> Recursively locking a non-recursive mutex should lead to the thread
> getting stuck forever, but not an assert.
>
> >>> gsequencer: ../nptl/pthread_mutex_lock.c:349:
> >>> __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e,
> >>> __err) != EDEADLK || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind !=
> >>> PTHREAD_MUTEX_RECURSIVE_NP)' failed.
> >>> Aborted
>
> We've had a failure in the futex syscall, but that should not by
> itself trigger an assert.
>
> The failure was either "no thread found" or "deadlock".
>
> The assert triggers when we get "deadlock" from the kernel but the
> mutex was error-checking or recursive. Internally we don't ever expect
> to get "deadlock" from the kernel for these kinds of mutexes and
> indicates an algorithmic problem.
>
> It's an algorithmic problem because earlier code should have detected
> we owned the mutex in the recursive case, bumped the ownership
> counter, and returned zero.
>
> It's an algorithmic problem because earlier code should have detected
> we owned the mutex in the error checking case, and should have
> returned EDEADLK without making any futex syscalls.
>
> So we didn't own the mutex and an attempt to acquire it determined it
> was locked by someone else (not us), and then the kernel returned
> EDEADLK, which doesn't make sense because we didn't own it to begin
> with!
>
> It points to a kernel or glibc issue with PI mutexes.
It may, but not necessarily. For example, the load of __lock that
handles the recursive/error-checking case is a separate access from the
CAS, so something else may have changed __lock in the meantime (eg, a
bug in the application).
A reproducer would be really helpful. If we can't get this, we'd at
least need some information about the affected mutex: the kind of the
mutex, how it's used by the program, etc.