This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Indefinite hang in getaddrinfo / check_pf / make_request


Hello,

We are having issues where our applications (so far reproduced both on Node.js and Mono applications)
enter a completely stuck state inside of libc calling into the kernel netlink interface.

When they get stuck, they have a characteristic stack trace.  Taken from a Node process:

#0  0x00007fd7d8d214ad in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fd7d8d3e44d in make_request (fd=fd@entry=13, pid=1) at ../sysdeps/unix/sysv/linux/check_pf.c:177
#2  0x00007fd7d8d3e9a4 in __check_pf (seen_ipv4=seen_ipv4@entry=0x7fd7d37fdd00, seen_ipv6=seen_ipv6@entry=0x7fd7d37fdd10, 
    in6ai=in6ai@entry=0x7fd7d37fdd40, in6ailen=in6ailen@entry=0x7fd7d37fdd50) at ../sysdeps/unix/sysv/linux/check_pf.c:341
#3  0x00007fd7d8cf64e1 in __GI_getaddrinfo (name=0x31216e0 "mesos-slave4-prod-uswest2.otsql.opentable.com", service=0x0, 
    hints=0x31216b0, pai=0x31f09e8) at ../sysdeps/posix/getaddrinfo.c:2355
#4  0x0000000000e101c8 in uv__getaddrinfo_work (w=0x31f09a0) at ../deps/uv/src/unix/getaddrinfo.c:102
#5  0x0000000000e09179 in worker (arg=<optimized out>) at ../deps/uv/src/threadpool.c:91
#6  0x0000000000e16eb1 in uv__thread_start (arg=<optimized out>) at ../deps/uv/src/unix/thread.c:49
#7  0x00007fd7d8ff3182 in start_thread (arg=0x7fd7d37fe700) at pthread_create.c:312
#8  0x00007fd7d8d2047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The recvmsg call never returns.

We found the following issue:
https://sourceware.org/bugzilla/show_bug.cgi?id=15946

This bug matches our symptoms perfectly.

However, we are running a libc that has the patch applied!

ii  libc6:amd64  2.19-0ubuntu6.6  amd64  Embedded GNU C Library: Shared libraries

https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/1328975

So now I'm confused, as we are still seeing the symptoms, but have the patch applied.

Once this hang happens, eventually all threads in the process end up blocked trying to take the check_pf lock,
and there is no recourse but to kill the process.

Is it possible there is another race condition or other error here?  How could we have so many processes
getting stuck here?  What diagnostics might I run to get a better fix on the problem?

We run vanilla kernel 4.0.4.  These processes are inside of a Docker container (1.7.1), but with network isolation in "host" mode which hopefully means that there is no separate network namespace that might be interfering.

Thank you for any advice, this issue is driving us crazy!


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]