This is the mail archive of the gdb-patches@sources.redhat.com mailing list for the GDB project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

problem unwinding past pthread_cond_wait() on x86 RedHat 9.0


Hello,

while trying to move from GDB 5.3 to 6.0, we noticed a small
"regression" in a backtrace after switching to a thread blocked
on a pthread_cond_wait() call. This occurs only on RH9 (we tried
on RH8 and RH7).

To reproduce the problem, compile the attached C program with
(I'll be more than happy to contribute that testcase if you are
interested. Coming from the Ada world where tasking is really easy,
I am not too familiar with pthreads, especially in terms of portability,
but I welcome critics :)

        % gcc -D_REENTRANT -g -o pt pt.c -lpthread

Then do the following:

        % gdb pt
        (gdb) break break_me
        (gdb) run
        (gdb) thread 2
        (gdb) bt

The thread that I created should be blocked on a pthread_cond_wait
waiting for a condition to be signaled. But we never signal this
condition, so it should wait there forever. So the backtrace I
expected should look like this:

        #0 in pthread_cond_wait ()
        #1 in cond_wait ()
        #2 in noreturn ()
        #3 in forever_thread ()
        (more frames follow)

Instead, here is all what we get:

        #0  0xffffe002 in ?? ()
        #1  0x4002d379 in pthread_cond_wait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0

With GDB 5.3, we used to get:

        #0  0xffffe002 in ?? ()
        #1  0x4002b2b6 in start_thread () from /lib/tls/libpthread.so.0

Which isn't any better, and that explains why I quoted "regression" in
the first paragraph. The change of behavior becomes much more negatively
obvious when we debug an Ada program, because instead of the not-so-correct
backtrace we used to get with 5.3 (missing a frame or two between #0 and
#1):

        #0  0xffffe002 in ?? ()
        #1  0x0804fb29 in system__tasking__rendezvous__accept_trivial ()
        #2  0x08049f48 in task_switch.callee (<_task>=0x806e708)
                          at task_switch.adb:29
        #3  0x08053394 in system__tasking__stages__task_wrapper ()
        #4  0x4002b2b6 in start_thread () from /lib/tls/libpthread.so.0

We now basically get almost nothing:

        #0  0xffffe002 in ?? ()
        #1  0x4002d379 in pthread_cond_wait@@GLIBC_2.3.2 ()
           from /lib/tls/libpthread.so.0

I think I found the source of the problem when looking at the assembly
code for pthread_cond_wait in libpthread.so. Here is what it looks like:

0x4002d2e0 <pthread_cond_wait+0>:          push   %edi
0x4002d2e1 <pthread_cond_wait+1>:          push   %esi
0x4002d2e2 <pthread_cond_wait+2>:          push   %ebx
[a bunch of instructions, including conditional jumps]
0x4002d2fa <pthread_cond_wait+26>:         pushl  0x14(%esp,1)
[...]
0x4002d324 <pthread_cond_wait+68>: sub    $0x20,%esp
[some other bunch of instructions, and then finally the code were we stopped:]
0x4002d372 <pthread_cond_wait+146>:        call   *%gs:0x10
0x4002d379 <pthread_cond_wait+153>:        sub    $0xc,%ebx

So we are at pthread_cond_wait+146, and the i386 frame code is trying to
unwind past this function. So it looks at the function prologue, finds
that it is frameless. So it uses the backup plan and is trying to find
the "frame" base using the SP instead of the base pointer. It then
analyzes the prologue and finds the 3 push instructions saving certain
registers, and therefore determines that the offset between the SP and
the BP must be these 12 bytes. Unfortunately, we missed the pushl
and the sub instructions that updated the SP by another 36 bytes!
So eventually the unwinder got the wrong frame base, and therefore
got the wrong address to fetch the saved EIP, which lead the unwinder
to stop because the EIP value fetch was NULL.

I tried an experiment of running the debugger under debugger
suppervision, and I assumed despite the numerous conditions jumps
everywhere that the "pushl" and the "sub" instructions were executed
exactly once. So I manually changed the offset to be 12 + 4 + 32 = 48
(so cache->offset was 44), and voila!

    #0  0xffffe002 in ?? ()
    #1  0x4002d379 in pthread_cond_wait@@GLIBC_2.3.2 ()
       from /lib/tls/libpthread.so.0
    #2  0x0804855e in cond_wait (cond=0x4083484c, mut=0x4083487c) at pt.c:9
    #3  0x080485a9 in noreturn () at pt.c:24
    #4  0x080485b9 in forever_pthread (unused=0x0) at pt.c:30
    #5  0x4002b2b6 in start_thread () from /lib/tls/libpthread.so.0
    #6  0x420de407 in clone () from /lib/tls/libc.so.6

The problem I am now trying to solve is the following: How can we fix
the i386 unwinder to be smart enough to handle this wicked function?
Is this even possible? The only possibility I see right now is with
dwarf2 CFI, but then the problem I foresee is that we can not help
the people using the stock RH9. If the only hope is with CFI, then
they will have to update their pthread library...

What do you guys think?

-- 
Joel

Attachment: pt.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]