This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug dyninst/15443] New: deal with mutatees that die during our handlers

From: "jistone at redhat dot com" <sourceware-bugzilla at sourceware dot org>
To: systemtap at sourceware dot org
Date: Tue, 07 May 2013 19:40:09 +0000
Subject: [Bug dyninst/15443] New: deal with mutatees that die during our handlers
Auto-submitted: auto-generated

http://sourceware.org/bugzilla/show_bug.cgi?id=15443

             Bug #: 15443
           Summary: deal with mutatees that die during our handlers
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: dyninst
        AssignedTo: systemtap@sourceware.org
        ReportedBy: jistone@redhat.com
    Classification: Unclassified


We can assume for a moment that our runtime is perfect, and never causes the
mutatee to die.  But what happens if a threaded mutatee exits (by signal or by
choice) or execs, while one of its threads is currently in one of our probe
handlers?  I expect at a minimum, that context mutex will be left forever
locked.  It's possible for much more to be left in inconsistent state too.

(I've been trying to debug some weird issues during testsuite runs, and while
I'm not certain this is the root cause, it does seem to be a real possibility.)

Maybe we could try to capture all exit/exec paths and "quiesce" other threads
(at least as far as our state is concerned).  I suspect that this would require
heroic effort though, and still probably imperfect. (e.g. SIGKILL is absolute.)

For mutexes, there is pthread_mutexattr_setrobust() which we should probably
use.  This will at least tell us EOWNERDEAD, and from there we can decide
whether recovery is possible.  That decision is probably different for each
mutex-locked area we have, e.g. a dead lock on a context struct can probably be
repurposed, but a dead lock on the transport seems worse.  But even handling
EOWNERDEAD as a fatal error would be better than just hanging.

For rwlock, I see no equivalent of setrobust().  These are used for global
variables, so we should probably just add timeouts.  (Not a trylock-wait-retry
loop as in kernel - I think just a plain timed[rd|wr]lock is fine.)

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]