This is the mail archive of the gdb-patches@sourceware.org mailing list for the GDB project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux

From: Joel Brobecker <brobecker at adacore dot com>
To: gdb-patches at sourceware dot org
Date: Tue, 4 Aug 2015 11:07:45 -0700
Subject: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
Authentication-results: sourceware.org; auth=none
Hello,

I have an issue debugging a program which uses threads, and which
only reproduces on SuSE 10; I'm not sure whether this is thanks
to pure luck, or whether something is specific to that particular
distro. The test is also sufficiently racy that it does not happen
all the time. Unfortunately, I can't share the code, and attempts at
creating another reproducer I can share have failed miserably.

I think I understand what's happening, but I am not sure what GDB ought
to be doing in that scenario.

In our scenario, we are trying to next/step our program after
having stopped at the start of its main subprogram:

    % gdb -q .obj/gprof/main
    (gdb) start
    (gdb) n
    (gdb) step
    [...]/infrun.c:2391: internal-error: resume: Assertion `sig != GDB_SIGNAL_0' failed.

The summary:

    The issue is related to the use of the _Unwind_DebugHook breakpoint.
    More precisely, the "next" happens to stop the inferior at an
    address which is the return address from the _Unwind_DebugHook, and
    then the "step" operation triggers the _Unwind_DebugHook breakpoint
    for another thread, before the initial thread had a chance to
    advance at all.  The handling of the _Unwind_DebugHook breakpoint
    causes GDB to insert a "bp_step_resume" breakpoint at the function's
    return address again, which happens to be the same as in the
    previous _Unwind_DebugHook call.  So, when GDB tries to step the
    initial thread again, it finds that it is stepping from a location
    that has a breakpoint with SIG being GDB_SIGNAL_0.

The detailed analysis so far:

At the end of the "start" command, we are stopped at the start
of function Main in main.adb.


    (gdb) start
    Temporary breakpoint 1 at 0x805451e: file /[...]/main.adb, line 57.
    Starting program: /[...]/main
    [New Thread 0xb7e5eba0 (LWP 28377)]
    [New Thread 0xb7c5aba0 (LWP 28378)]
    [New Thread 0xb7a56ba0 (LWP 28379)]

    Temporary breakpoint 1, main () at /[...]/main.adb:57
    [...]

There are 4 threads in total, and we are in the main thread (which is
thread 1):

    (gdb) info thread
      Id   Target Id         Frame
      4    Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall ()
      3    Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall ()
      2    Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall ()
    * 1    Thread 0xb7ea18c0 (LWP 28370) main ()
        at /glinos.a/users/brobecke/ex/O701-042/Pre_Post_Sub/main.adb:57

All the logs below reference Thread ID/LWP, but I think it'll be easier
to talk about the the thread by thread number. For instance, thread 1
is LWP 28370 while thread 3 is LWP 28378. So, I will translate in my
explanations the LWPs into thread numbers.

Back to our program. At this point, we attempt a "next" (from thread 1),
and here is what happens:

    (gdb) n
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8054523

We've resumed thread 1 (LWP 28370), and received in return a signal
that the same thread stopped slightly further. It's still in the range
of instructions for the line of source we started the "next" from,
as evidenced by the following trace...

    infrun: stepping inside range [0x805451e-0x8054531]

... and thus, we decide to continue stepping the same thread:

    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523
    infrun: prepare_to_wait

That's when we get an event from a different thread (thread 3):

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)

... which we find to be at the address where we set a breakpoint
on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's
why GDB reports for this event that this is...

    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and anlyzing the arguments passed to this function, it finds
that the return address is 0x80542a2. That's what the next trace
means, when it says:

    infrun: exception resume at 80542a2

The trace above also indicates that GDB created an internal
"bp_step_resume" breakpoint at that address. The address seems
innocent right now, but it'll become important later on, during
the step.

And now that the event has been analyzed and we've determmined
that we stopped at an exception breakpoint, GDB needs to step
thread 3 past the breakpoint it just hit. Thus, it temporarily
disables the exception breakpoint, and requests a step of that
thread:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0
    infrun: prepare_to_wait

We then get a notification, still from thread 3, that it's now past
that breakpoint...

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... so we can resume what we were doing before, which is single-stepping
thread 1 until we get to a new line of code:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523

The "resume" log above shows that we're resuming thread 1 from
where we left off (0x8054523). We get one more stop at 0x8054529,
which is still inside our stepping range so we go again. That's
when we get the following event, from thread 3:

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80542a2

Now the stop_pc adddres is interesting, because it's the address
of "exception resume" breakpoint. When GDB sees this, it knows

    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME

... and since the location is at a different line of code,
this is where it decide the "next" operation should stop:

    infrun: stop_waiting
    [Switching to Thread 0xb7c5aba0 (LWP 28378)]
    0x080542a2 in inte_tache_rt.ttache_rt (
        <_task>=0x80968ec <inte_tache_rt_inst.tache2>)
        at /[...]/inte_tache_rt.adb:54
    54            end loop;

So far, things seem to be all normal. But the important element
to consider, here, is the fact that we stopped at an address that
also happens to be the address of the "exception resume" breakpoint.

Now that the plot has been set, we can try to "step", and see what
happens when things do not go well. First, GDB resumes the current
thread, which is now thread 3:

    (gdb) step
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2

Enter thread 4, whom we haven't heard of yet, who now triggers
our exception breakpoint at _Unwind_DebugHook...

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7a56ba0 (LWP 28379)
    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and once more, GDB finds that the "exception resume" breakpoint
should be set at 0x80542a2:

    infrun: exception resume at 80542a2

Same dance as before, we need to single-step that thread out of
the breakpoint, so:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7a56ba0 (LWP 28379)] at 0x80782d0
    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... at which point we get back to where we left off again, which is
stepping thread 3 because it hasn't "moved" yet:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7a56ba0 (LWP 28379) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2

The error happens during the resume, because we're trying to step
from a location which has a breakpoint...

  if (execution_direction != EXEC_REVERSE
      && step && breakpoint_inserted_here_p (aspace, pc))

... and the code has a very long explanation of why this can only
happen if we have a signal to deliver (hence the assertion).
Note that the "exception resume" breakpoint has an additional
condition that limits its effectiveness to thread 4 only, which
is the thread that hit the _Unwind_DebugHook breakpoint. But,
at the target level, breakpoints apply to all thread, and thread-specific
handling is done by GDB, by automatically restarting a program
when a non-matching thread hits the thread-specific breakpoint.

I am not really sure what to think of that, whether the assertion
is just too strict, or whether we need special handling of that
situation. In particular, I'm wondering whether the code in the
"if" requires the assertion to be true in order to be correct.
It does:

      tp->stepped_breakpoint = 1;

      /* Most targets can step a breakpoint instruction, thus
         executing it normally.  But if this one cannot, just
         continue and we will hit it anyway.  */
      if (gdbarch_cannot_step_breakpoint (gdbarch))
        step = 0;

Just for kicks, I tried commenting out the assertion, and after
a couple of tries, I got at the end of the step...

    (gdb) step
    /[...]/infrun.c:4865: internal-error: process_event_stop_test: Assertion `ecs->event_thread->control.exception_resume_breakpoint != NULL' failed.

Any suggestion? Maybe catch this specific case where we're trying
to single-step a thread from the exception_resume_breakpoint by
adding some code that would single-stepp the one thread alone out
of that address, before resuming a more normal single-step operation?

Thanks!
-- 
Joel
Follow-Ups:
- Re: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux
  - From: Pedro Alves
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]