This is the mail archive of the
mailing list for the glibc project.
Re: Unwarranted assumption in tst-waitid, or a kernel bug?
- From: Roland McGrath <roland at redhat dot com>
- To: ppluzhnikov at google dot com (Paul Pluzhnikov)
- Cc: Oleg Nesterov <oleg at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Tue, 21 Sep 2010 18:44:18 -0700 (PDT)
- Subject: Re: Unwarranted assumption in tst-waitid, or a kernel bug?
- References: <20100921174328.91D2C764F8@ppluzhnikov.mtv.corp.google.com>
Oleg, see http://sourceware.org/ml/libc-alpha/2010-09/msg00030.html
The standard that would specify this is POSIX (http://www.unix.org/version4/).
There is a lot of verbiage about SIGCHLD and wait, but I don't think it
does actually specify what the test case is assuming. There is more
explicit wording that has to do with the inverse assumption: that when
SIGCHLD is delivered, the wait result will be immediately available.
There is some other mention that could be read obliquely to suggest the
expectation that the SIGCHLD would be pending when wait succeeds. That
is for the case when SIGCHLD is blocked before a wait call, where some
systems actually remove the pending SIGCHLD when there is no longer any
process with a wait status ready to report. But all that stuff is for
implementation-defined behavior, and Linux has never done that.
If you want to get a firm clarification on the POSIX standard, you can
file an interpretation request with the Austin Group. (You could also
just ask on their mailing list first, <email@example.com>.)
As far as I can tell, Linux has never had a guarantee like this. From a
cursory look at the code in a few versions, I think the differences
you've seen between kernel versions are due to scheduling changes, not
that the actual local constraints in the exit/SIGCHLD/wait code paths
have changed at all.
It is the case that the SIGCHLD is sent before waking up blocked wait
calls. That is, if you were already blocked in wait when the SIGSTOP or
SIGKILL was sent, then SIGCHLD would indeed be delivered before wait
returned. But there is no way to use that in a single-threaded test
(short of also making it a POSIX timers test or something else nutty).
The sequence of events in the kernel is that first the process makes
itself available for wait (i.e. stopped or zombie), then it sends
SIGCHLD, and finally it wakes up blocked wait calls that might be
What the test does (twice in the "working" variant, which is AFAICT is
no different as far as assuming behavior that is not guaranteed--it just
does the same racy thing twice in a row) is kill and then immediately
wait. If the scheduling of the child is fast enough (or that of the
parent slow enough), then the wait can succeed and return fully back to
the user program before the process has sent its SIGCHLD.