Pthread Cancellation Race Condition Problems

The current approach to implementing pthread cancellation points is to enable asynchronous cancellation prior to making the syscall, and restore the previous cancellation type once the syscall returns. [NOTE: check if this fulfills POSIX requirements or not] And as described in bz#12683 bug report, current implementation suffers from some issues:

Cancellation can act after the syscall has returned from kernelspace, but before userspace saves the return value. This results is a resource leak if the syscall allocated a resource, and there is no way to patch over it with cancellation handlers. Even if the syscall did not allocate a resource, it may have had an effect (like consuming data from a socket/pipe/terminal buffer) which the application will never see. This is shown by the example:

/* bz12683-test1.c: demonstration of file description leak.
 
   Build with: gcc -Wall bz12683-test1.c -o bz12683-test1 -lpthread  */

#include <pthread.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

void *
writeopener (void *arg)
{
  int fd;
  for (;;)
    {
      fd = open (arg, O_WRONLY);
      close (fd);
    }
}

void *
leaker (void *arg)
{
  int fd = open (arg, O_RDONLY);
  pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0);
  close (fd);
  return 0;
}


#define ITER_COUNT 10000

int
main ()
{
  pthread_t td, bg;
  struct stat st;
  char tmp[] = "/tmp/cancel_race_XXXXXX";
  int i, leaks = 0;

  (void)mktemp (tmp);
  mkfifo (tmp, 0600);
  srand (1);
  pthread_create (&bg, 0, writeopener, tmp);
  for (i = 0; i < ITER_COUNT; i++)
    {
      pthread_create (&td, 0, leaker, tmp);
      nanosleep ((&(struct timespec)
                  {
                  .tv_nsec = rand () % 100000}), 0);
      pthread_cancel (td);
      pthread_join (td, 0);
    }
  unlink (tmp);
  for (i = 3; i < 1024; i++)
    {
      if (!fstat (i, &st))
        leaks++, printf ("leaked fd %d\n", i);
    }
  return !!leaks;
}

Running on current GLIBC it shows a result like below. The total number of leaked file descriptors can vary since it is a timing issue with randomness:

$ ./bz12683-test1 leaked fd 3 leaked fd 4 leaked fd 5 leaked fd 6 leaked fd 7 leaked fd 8 leaked fd 9 leaked fd 10 leaked fd 11 ...

If a signal is handled while the thread is blocked at a cancellable syscall, the entire signal handler runs with asynchronous cancellation enabled. This could be extremely dangerous, since the signal handler may call functions which are async-signal-safe but not async-cancel-safe. Even worse, the signal handler may call functions which are not even async-signal-safe (like stdio) if it knows the interrupted code could only be using async-signal-safe functions, and having a thread asynchronously terminated while modifying such functions' internal data structures could lead to serious program malfunction. This is shown by the example:

/* bz12683-test2.c : demonstration of signal handler being executed in
                     asynchronous cancellation enabled.

   build with: gcc -Wall bz12683-test2.c -o bz12683-test2 -pthread  */

#define _XOPEN_SOURCE 700
#include <unistd.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <signal.h>

volatile int x;

static void
handler (int sig)
{
  for (;;)
    x++;
}

static void
bye (void *arg)
{
  printf ("x=%d\n", x);
}

static void *
start (void *foo)
{
  sigset_t set;
  sigemptyset (&set);
  sigaddset (&set, SIGUSR1);
  pthread_cleanup_push (bye, 0);
  pthread_sigmask (SIG_UNBLOCK, &set, 0);
  for (;;)
    sleep (1);
  pthread_cleanup_pop (0);
  return 0;
}

int
main ()
{
  pthread_t t;
  struct sigaction sa = { { 0 } } ;
  sigemptyset (&sa.sa_mask);
  sa.sa_handler = handler;
  sa.sa_flags = SA_RESTART;
  sigaction (SIGUSR1, &sa, 0);
  sigaddset (&sa.sa_mask, SIGUSR1);
  pthread_sigmask (SIG_BLOCK, &sa.sa_mask, 0);

  pthread_create (&t, 0, start, 0);
  nanosleep (&(struct timespec){.tv_nsec = 100000000}, 0);
  kill (getpid (), SIGUSR1);
  nanosleep (&(struct timespec){.tv_nsec = 100000000}, 0);
  pthread_cancel (t);
  pthread_join (t, 0);
  return 0;
}

Running on current GLIBC it yields, when it should hang:

$ ./bz12683-test2 x=38589543 $ ./bz12683-test2 x=39579798

These issues are not covered in any testcase currently.

Current implementation

Current pthread implementation for cancellation resolves around both arch-specific code on wrapper syscalls and also on arch independent NPTL code. The arch-specific code is provided in the sysdep-cancel.h headers through a set of macros to override from default architecture sysdep.h ones. In syscall auto-generation, if the function is a cancellable one (SYSCALL_CANCELLABLE defined to 1) sysdeps/unix/syscall-template.S will include sysdeps-cancel.h instead of sysdep.h

The basic strategy each arch should be provide in 'sysdep-cancel.h' headers is as following:

/* Code em pseudo-algorithm to simplify explanation.  */
PSEUDO(name, syscall, args)
  ENTRY (name)
    if (!SINGLE_THREAD_P)
      goto pseudo_cancel;
  DO_CALL (syscall_name)
  return

pseudo_cancel:
  CANCELLATION_ENABLE
  CALL (syscall_name)
  CANCELLATION_DISABLE

Both CANCELLATION_ENABLE and CANCELLATION_DISABLE will be library internal symbols (pthread_enable_asynccancel, libc_enable_asynccancel, for instance) for each module that provides or uses cancelable syscalls. The SINGLE_THREAD_P checks if program is in single-thread mode (loader and default case) and avoids all the cancellation save/restore altogether.

The cancellation enable/disable symbols are implemented at:

nptl/cancellation.c: default functions used by libpthread build.
nptl/libc-cancellation.c: used on both libc.so and loader builds, current implementation just overrides NPTL's one with different symbol names.
sysdeps/nptl/librt-cancellation.c: used on librt.so, and current implementation just overrides NPTL's one with different symbol names.
sysdeps/unix/sysv/linux/x86_64/cancellation.S: x86_64 assembly implementation of nptl/cancellation.c.
sysdeps/unix/sysv/linux/x86_64/libc-cancellation.S: x86_64 assembly implementation of nptl/libc-cancellation.c.
sysdeps/unix/sysv/linux/x86_64/librt-cancellation.S: x86_64 assembly implementation of sysdeps/nptl/librt-cancellation.c.
nptl/nptl-init.c: GLIBC defines SIGCANCEL as SIG32 and install a handler on for asynchronous cancellation

Proposed solution

As described in comment #16 there are 5 points at which the cancellation signal could arrive:

Before the final testcancel before the syscall is made.
Between the testcancel and the syscall.
While the syscall is blocked and no side effects have yet taken place.
While the syscall is blocked but with some side effects already having taken place (e.g. a partial read or write).
After the syscall has returned.

And basically GLIBC wants to act on cancellation in cases 1, 2, and 3 but not in case 4 or 5. And proposed solution, following Rich Felker suggestion is:

Handling case 1 is trivial: do a conditional branch based on whether the thread has received a cancellation request;
Case 2 can be caught by the signal handler determining that the saved program counter (from the ucontext_t) is in some address range beginning just before the "testcancel" and ending with the syscall instruction.
In this case, except for certain syscalls that ALWAYS fail with EINTR even for non-interrupting signals, the kernel will reset the program counter to point at the syscall instruction during signal handling, so that the syscall is restarted when the signal handler returns. So, from the signal handler's standpoint, this looks the same as case 2, and thus it's taken care of.
In this case, the kernel cannot restart the syscall; when it's interrupted by a signal, the kernel must cause the syscall to return with whatever partial result it obtained (e.g. partial read or write). In this case, the saved program counter points just after the syscall instruction, so the signal handler won't act on cancellation.
This one is equal to 4. since the program counter is past the syscall instruction already.

Another case that needs handling is syscalls that fail with EINTR even when the signal handler is non-interrupting. In this case, the syscall wrapper code can just check the cancellation flag when the errno result is EINTR, and act on cancellation if it's set.

Note: check if an exception needs to be made for close(), where EINTR should be treated as EINPROGRESS and thus not permit cancellation to take place.

GLIBC adjustment to proposed solution

The proposed solution required an overhaul of both current GLIBC cancelable syscall code and pthread cancellation mechanism:

Remove the enable_asynccancel/disable_asynccancel function usage in syscall definition and instead make them call a common symbol that will check if cancellation is enabled, call the arch-specific cancellable entry-point and cancel the thread when required.
Provide a arch-specific symbol that contains global markers. These markers will be used in SIGCANCEL handler to check if the interruption has been called in a valid syscall and if the syscalls has been completed or not.
Rewrite SIGCANCEL asynchronous handler to check for both cancelling type and if current IP from signal handler falls between the global markes and act accordingly.
Adjust nptl/pthread_cancel.c to send an signal instead of acting directly. This avoid synchronization issues about updating the cancellation status and also focus the logic on signal handler and cancellation syscall code.
Adjust pthread code to replace CANCEL_ASYNC/CANCEL_RESET calls to appropriated cancelable futex syscalls.
Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to appropriated cancelable syscalls.
Rewrite sysdeps/unix/sysv/linux/not-cancel.h to calls inline syscalls for non-cancelable versions.
Adjust 'lowlevellock.h' arch-specific implementations to provide cancelable futex calls (used in libpthread code).