This is the mail archive of the cygwin-developers@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Successfully attached to hung "make -j2" process!


Cgf was right -- things have progressed sufficiently since 1.1.8 that
when the "make -j2" hang we've been seeing forever manifested itself
again with a cygwin1.dll built with --enable-debugging out of the
current CVS repository, I was able to successfully attach to it.
Here's some background information about the problem, followed by what
I'm seeing in the current instance of it....

We run our automated builds with "make -j2" on dual-processor
machines.  On a very regular basis, they hang and can't be killed.
When they hang, we see two or more Make processes in a parent-child
chain, separated by sh processes (we call Make recursively, so this is
understandable), and one or more of them is in state "I" (according to
ps).  Until now, I've been unable to attach to the hung processes to
find out what's going on.

I just duplicated this problem running "make -j2" on a Cygwin build
tree.  I now have two hung Make processes, one of which is in state
"I".  I can attach the debugger to both of them.  Both of them are
hung inside read().

I'm going to take a little aside to explain how the Make "job server"
works, in case you don't know, because I'm pretty certain, from what I
see in these stack traces, that the problem is related to the job
server.

When Make is run with "-j<n>" and the job server is enabled, it
creates a pipe which is shared by all Make processes -- the master
Make process run from the command line and all the children it starts
up.  It writes into the pipe <n-1> characters, representing tokens for
the jobs that are allowed to run in parallel at any given time.

When any Make process running as part of the build decides that it has
a command that it wants to execute, it reads a token from the read end
of the pipe.  When the job is finished, it writes the token back to
the write end of the pipe so that another Make process can read it and
start a job.

Both hung Make processes appear to be hung trying to read a token from
the job server pipe.

This is the child Make process, the one which ps shows in state "I":

  (gdb) thread apply all where

  Thread 4 (thread 297.0xe1):
  #0  0x77f762e9 in ?? ()
  #1  0x77f04ede in ?? ()

  Thread 3 (thread 297.0xec):
  #0  0x77f682cb in ?? ()
  #1  0x77f1cd76 in ?? ()
  #2  0x610566f9 in wait_subproc ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1240
  #3  0x61004e9a in thread_stub (arg=0x6109c9f8)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97
  #4  0x77f04ede in ?? ()

  Thread 2 (thread 297.0xf0):
  #0  0x77f682cb in ?? ()
  #1  0x77f1cd76 in ?? ()
  #2  0x610563a5 in wait_sig ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1126
  #3  0x61004e9a in thread_stub (arg=0x6109c9f8)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97
  #4  0x77f04ede in ?? ()

  Thread 1 (thread 297.0xbf):
  #0  0x77f67f07 in ?? ()
  #1  0x6101158f in fhandler_base::raw_read (this=0x614f3ffc, ptr=0x24ad432,
      ulen=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/fhandler.cc:219
  #2  0x61011e34 in fhandler_base::read (this=0x614f3ffc, in_ptr=0x24ad432,
      in_len=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/fhandler.cc:488
  #3  0x6105c004 in _read (fd=5, ptr=0x24ad432, len=1)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/syscalls.cc:316
  #4  0x61085a6e in read (fd=5, buf=0x24ad432, cnt=1)
      at /u/jik/cygwin-cvs/src/newlib/libc/syscalls/sysread.c:15
  #5  0x6100ff74 in sigdelayed ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/exceptions.cc:1229
  #6  0x402a14 in execute_file_commands (file=0xa01ec78)
      at /u/jik/systems-cvs-root/cygwin/make/commands.c:362
  #7  0x421abc in remake_file (file=0xa01ec78)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:1008
  #8  0x42096b in update_file_1 (file=0xa01ec78, depth=4)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:659
  #9  0x41ea78 in update_file (file=0xa01ec78, depth=4)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #10 0x420f9d in check_dep (file=0xa01ec78, depth=4, this_mtime=1,
      must_make_ptr=0x24ad67c)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:813
  #11 0x41fb13 in update_file_1 (file=0xa01b0f0, depth=3)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:450
  #12 0x41ea78 in update_file (file=0xa01b0f0, depth=2)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #13 0x420f9d in check_dep (file=0xa01b0f0, depth=2, this_mtime=1,
      must_make_ptr=0x24ad7bc)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:813
  #14 0x41fb13 in update_file_1 (file=0xa013ba8, depth=1)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:450
  #15 0x41ea78 in update_file (file=0xa013ba8, depth=0)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #16 0x41e55d in update_goal_chain (goals=0xa013c08, makefiles=0)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:156
  #17 0x41560c in main (argc=73, argv=0x614f6df4, envp=0xa010008)
      at /u/jik/systems-cvs-root/cygwin/make/main.c:1998
  #18 0x61004294 in dll_crt0_1 ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:770
  #19 0x61004565 in _dll_crt0 ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:837
  #20 0x610045a4 in dll_crt0 (uptr=0x0)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:849
  #21 0x42e7d3 in cygwin_crt0 ()
  #0  0x77f762e9 in ?? ()

Line 362 in Make's commands.c is this:

  new_job (file);

I assume that new_job is hung inside the read call in this snippet of
code:

        /* Read a token.  As long as there's no token available we'll block.
           If we get a SIGCHLD we'll return with EINTR.  If one happened
           before we got here we'll return immediately with EBADF because
           the signal handler closes the dup'd file descriptor.  */

        if (read (job_rfd, &token, 1) == 1)
          {
            DB (DB_JOBS, (_("Obtained token for child 0x%08lx (%s).\n"),
                          (unsigned long int) c, c->file->name));
            break;
          }

I'm unsure of how to confirm this.  Since new_job doesn't show up in
the stack trace above (why not?), I can't enter the new_job frame and
ask for the address of "token" to confirm that it matches the address
in the call to read().

There is something odd here -- if it's trying to read from a pipe, why
is it calling into fhandle_base instead of fhandler_pipe?  Here's the
parent Make process, which does *not* show "I" in its ps output, and
which *is* calling into fhandler_pipe:

  (gdb) thread apply all where

  Thread 4 (thread 251.0xaa):
  #0  0x77f762e9 in ?? ()
  #1  0x77f04ede in ?? ()

  Thread 3 (thread 251.0xff):
  #0  0x77f682cb in ?? ()
  #1  0x77f1cd76 in ?? ()
  #2  0x610566f9 in wait_subproc ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1240
  #3  0x61004e9a in thread_stub (arg=0x6109c9f8)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97
  #4  0x77f04ede in ?? ()

  Thread 2 (thread 251.0x94):
  #0  0x77f682cb in ?? ()
  #1  0x77f1cd76 in ?? ()
  #2  0x610563a5 in wait_sig ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1126
  #3  0x61004e9a in thread_stub (arg=0x6109c9f8)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97
  #4  0x77f04ede in ?? ()

  Thread 1 (thread 251.0x12d):
  #0  0x77f67a5b in ?? ()
  #1  0x6104db43 in peek_pipe (s=0x24aef64, ignra=0)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/select.cc:458
  #2  0x6104de44 in fhandler_pipe::ready_for_read (this=0x614f1be4, fd=5,
      howlong=4294967295, ignra=0)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/select.cc:500
  #3  0x6105bf90 in _read (fd=5, ptr=0x24af052, len=1)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/syscalls.cc:300
  #4  0x61085a6e in read (fd=5, buf=0x24af052, cnt=1)
      at /u/jik/cygwin-cvs/src/newlib/libc/syscalls/sysread.c:15
  #5  0x6100ff74 in sigdelayed ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/exceptions.cc:1229
  #6  0x402a14 in execute_file_commands (file=0xa02c448)
      at /u/jik/systems-cvs-root/cygwin/make/commands.c:362
  #7  0x421abc in remake_file (file=0xa02c448)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:1008
  #8  0x42096b in update_file_1 (file=0xa02c448, depth=4)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:659
  #9  0x41ea78 in update_file (file=0xa02c448, depth=4)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #10 0x420f9d in check_dep (file=0xa02c448, depth=4, this_mtime=1,
      must_make_ptr=0x24af29c)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:813
  #11 0x41fb13 in update_file_1 (file=0xa01e898, depth=3)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:450
  #12 0x41ea78 in update_file (file=0xa01e898, depth=2)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #13 0x420f9d in check_dep (file=0xa01e898, depth=2, this_mtime=1,
      must_make_ptr=0x24af3dc)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:813
  #14 0x41fb13 in update_file_1 (file=0xa019080, depth=1)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:450
  #15 0x41ea78 in update_file (file=0xa019080, depth=0)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:309
  #16 0x41e55d in update_goal_chain (goals=0xa027ca8, makefiles=0)
      at /u/jik/systems-cvs-root/cygwin/make/remake.c:156
  #17 0x41560c in main (argc=2, argv=0x614f621c, envp=0xa010008)
      at /u/jik/systems-cvs-root/cygwin/make/main.c:1998
  #18 0x61004294 in dll_crt0_1 ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:770
  #19 0x61004565 in _dll_crt0 ()
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:837
  #20 0x610045a4 in dll_crt0 (uptr=0x0)
      at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:849
  #21 0x42e7d3 in cygwin_crt0 ()

>From what I've observed, I have some theories about what might be
going wrong, but I'm unsure of how to proceed at investigating them:

1) Somehow Cygwin is getting confused about a file descriptor and
   forgets that it's a pipe, so it can't read a token from it.

2) A token is getting "lost" somewhere inside the Cygwin or Windows
   pipe implementation.

3) If a process writes data into a pipe and then exits before the
   character is read out of the pipe, is the data lost?  If so, then
   this could explain how tokens are getting lost.

I'm probably not going to have any more time to play with this today.
and time will be scarce next week too.  However, absent any
suggestions from anyone on the list, when I have time, I'll test
whether (3) is a problem by writing some programs which write data to
and from pipes and seeing what happens when one end exits before the
other end reads.  (2) will be slightly harder to test -- I'll have to
write programs which do all kinds of forks and execs and writing to
and from pipe file descriptors and see if any of the data written to
the pipe gets lost.  I'm unsure of how to investigate (1) further.

Thoughts?

  jik


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]