This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH] Remove signal handling for nanosleep (bug 16364)
- From: Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- To: Florian Weimer <fweimer at redhat dot com>
- Cc: libc-alpha at sourceware dot org
- Date: Mon, 9 Nov 2015 14:30:43 -0200
- Subject: Re: [PATCH] Remove signal handling for nanosleep (bug 16364)
- Authentication-results: sourceware.org; auth=none
- References: <1447023171-31542-1-git-send-email-adhemerval dot zanella at linaro dot com> <56405109 dot 9070404 at redhat dot com> <56408F14 dot 1040600 at linaro dot org> <56409561 dot 7050707 at redhat dot com> <5640A89D dot 80804 at linaro dot org> <5640AD98 dot 5030105 at redhat dot com>
On 09-11-2015 12:28, Florian Weimer wrote:
> On 11/09/2015 03:07 PM, Adhemerval Zanella wrote:
>
>>> Could you write a glibc test case for this? I don't feel comfortable
>>> with removing the workaround without a test, and I can't find an
>>> existing one.
>>
>> What about:
>>
>> [BZ #16364]
>> * sysdeps/unix/sysv/linux/sleep.c (__sleep): Simplify code to use
>> nanosleep without requiring to handle spurious wakeups.
>> * posix/tst-nanosleep-signal.c: New file.
>> * posix/Makefile (test): Add tst-nanosleep-signal.
>
> Thanks, but see below for details.
>
>> + int ret;
>> + pid_t f = fork ();
>> + if (f == 0)
>> + {
>> + // child
>> + struct timespec tv = { .tv_sec = 0, .tv_nsec = 250000000UL };
>> + nanosleep (&tv, &tv);
>> + exit (0);
>
>
> This needs to be “_exit (0);”, to avoid running termination handlers twice.
>
I will change it.
>> + }
>> + else
>> + {
>> + // parent
>> + struct timespec tv = { .tv_sec = 0, .tv_nsec = 500000000UL };
>> + ret = nanosleep (&tv, &tv);
>> + }
>> + return ret;
>> +}
>
> The parent needs to check for fork errors.
>
Right, I will change it.
>> +
>> +void
>> +check_nanosleep_sigdfl(void)
>
> Space before paren.
>
>
>> +{
>> + /* For SIG_DFL nanosleep should not be interrupted by SIGCHLD. */
>> + signal(SIGCHLD, SIG_DFL);
>
> Space before paren.
>
>> + if (check_nanosleep_base ())
>> + {
>> + puts ("error: nanosleep was interrupted with SIG_DFL");
>> + errors = 1;
>> + }
>> +}
>
> This should check that errno is EINTR.
I do not think this check is strictly required: the other failures (EFAULT and
EINVAL) should also indicate something very flaky on the system.
>
>> +
>> +void
>> +check_nanosleep_dummy(void)
>
> Space before paren.
>
>> +{
>> + /* With a dummy handler nanosleep should be interrupted. */
>> + signal(SIGCHLD, dummy);
>
> Space before paren.
>
>> + int ret = check_nanosleep_base ();
>> + if (ret == 0)
>> + {
>> + puts ("error: nanosleep was not interrupted with dummy handler");
>> + errors = 1;
>> + }
>> +}
>
> Missing EINTR check.
Also the idea here is to indicate any failure for nanosleep (same as before).
>
>> +
>> +void
>> +check_nanosleep_sigign(void)
>
> Space before paren.
>
>> +{
>> + /* For SIG_IGN nanosleep should not be interrupted by SIGCHLD. */
>> + signal(SIGCHLD, SIG_IGN);
>
> Space before paren.
>
> The test is racy in two ways: the child could exit before nanosleep in
> the parent starts, or the child exit could be delayed after nanosleep in
> the parent ends. I'm not sure if there is way to make this more reliable.
I am aware of that, that's why I try to mitigate this by making the time on
parent twice for the child and some magnitude higher than the syscalls itself.
However I also can't see a way to make this test entirely reliable: even by
doing some synchronization (either by shared semaphores or pthread barrier)
and/or sending the SIGCHLD directly using signal the two race scenarios you
describe will still have a small window to occur. I am not sure which
strategy will be better and I think we should not rely or add hacks to try
to mitigate for such kernel failures.
>
> The larger question is whether the EINTR check is sufficient, or if a
> time-based check is needed as well. That is, if the kernel bug
> consistent of silent early termination of nanosleep.
My understanding is on old kernels the nanosleep calls was not restarted
in a nanosleep call (with the restart_syscall), so it nanosleep will
early terminate.
>
> Florian
>