This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Hangs in realloc/free pthread lock during an Ohphone session


Hi, there are a lot of months I have problem with strange hangs
at the beginning or the end of an Ohphone telephony session, in
different places but in the same way.
I am working on a powerpc NPE405L cpu from IBM and an embedded version
of Linux from Montavista with libc-2.2.3.so.
I use a pthread static library from
ftp.gnu.org/pub/gnu/glibc/glibc-linuxthreads-2.2.3.tar.gz

I was able to make an Ohphone version with a lot of trace that hangs very
often in the same way: some threads stop waiting for a lock at the
beginning of free or realloc function, but nobody owns this lock.

There are some possibilities to explain this situation:
1) an unlock call was missed
2) a restart call did not work
3) somewhere somebody wrote a value 1 in the malloc related lock
4) something went wrong in compare_and_swap macro, and the unlock
   was lost.

I don't know what is the true explanation for this situation,
because if I make a pthread static library, tracing lock and
unlock operation on malloc/free lock, and I use the same Ohphone version,
no hang happens also if I try thirty or fourty times to open and close
a telephony session.
It is true that the traced pthread library comes from penguinppc
sources and is related to glibc-2.2.3, and that also with the not traced
library I have no hang with the same Ohphone version, but I believe
it is only a time dependent reason, different code is generated,
but the shared libc library is the same in all the situation.

Looking at each of the many hangs I have, it seems to me that something
goes wrong in unlock operation, because everything is clean, the
wait_node queue on the lock is correct, and the first thread suspended,
waiting for the lock, found a lock status = 1.

I checked in the Linux kernel that returning to user mode from
whatever interrupt (internal/external/software interrupt) there
is a stwcx instruction to clean the reservation bit and make
the compare_and_swap macro work fine; in fact each time there is
a return to user mode, the instruction sequence at <restore> label
is executed, reservation bit is cleared and sync is done with the
following instructions:

        PPC405_ERR77(0,r1)
        stwcx. r0,0,r1
        ..........
        SYNC
        PPC405_ERR77_SYNC
        RFI

I checked that signal handlers in pthread library and Ohphone stuff do not
use compare_and_swap macro.

I don't believe that someone somewhere writes a 1 in the malloc status lock
for error, it would be very strange if always writes the same value in
the same location, but everything is possible.

If compare_and_swap macro works fine, I don't see any possibility
for pthread restart to fail.

The very strange thing in all the hangs is that the application always
hangs in free and realloc function, never in malloc function.
Ohphone stuff is a very high consumer of malloc/realloc/free functions.
It seems that an unlock is missed, involving free and realloc suspension
on malloc lock, and that if the malloc function is called, in this
situation never wait for the lock.

I think that I can have memory lack problem (Ohphone stuff is
very very big) but I don't expect such a kind of disastrous hang.

Some waiting thread gdb session trace are available.

Anyway has someone any idea about the reason of the hang situations
always involving malloc/free function with a lot of threads and a
normal scheduling policy?

roberto


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]