This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug libc/17802] New: The DNS resolver gets stuck when one nameserver is down
- From: "luto at mit dot edu" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Tue, 06 Jan 2015 01:56:09 +0000
- Subject: [Bug libc/17802] New: The DNS resolver gets stuck when one nameserver is down
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=17802
Bug ID: 17802
Summary: The DNS resolver gets stuck when one nameserver is
down
Product: glibc
Version: 2.19
Status: NEW
Severity: normal
Priority: P2
Component: libc
Assignee: unassigned at sourceware dot org
Reporter: luto at mit dot edu
CC: drepper.fsp at gmail dot com
I have a production deployment. resolv.conf contains:
nameserver A
nameserver B
gai.conf has only comments (default Ubuntu config). nsswitch.conf is the
default, which contains:
hosts: files mdns4_minimal [NOTFOUND=return] dns
I have long-running programs that use getaddrinfo to resolve the same hostname
over and over. It's using Python:
socket.getaddrinfo('host.name.com', '22', socket.AF_UNSPEC, socket.SOCK_STREAM)
When nameserver A goes down, which happens every now and then, frequently one
or more of the long-running services will have that getaddrinfo call fail every
time. When this is happening, a packet capture shows this sequence of events:
t=0: query to A
t = 5.005 seconds: query to B
t = 5.053153 seconds: response from B (looks valid to me)
t = 5.053208 seconds: new query to A
t = 10.058251 seconds: new query to B
t = 10.076853 seconds: response from B
t = 10.076908 seconds: query for host.name.com.my.domain.name to A
t = 15.080450 seconds: query for host.name.com.my.domain.name to B
t = 15.099330 seconds: NXDOMAIN from B
t = 15.099387 seconds: query for host.name.com.my.domain.name to A
t = 20.104418 seconds: query for host.name.com.my.domain.name to B
t = 20.123345 seconds: NXDOMAIN from B
At this point, I get "gaierror: [Errno -2] Name or service not known" back from
Python.
>From the timing, it looks like glibc is considering the actual valid responses
from B to be failures.
If I restart the service with nameserver A still down, everything works (it
tries A, then tries B 5 seconds later and accepts the answer).
The system in question does not use iptables.
I can't consistently reproduce this, unfortunately. I can try to run different
diagnostics the next time it happens, though.
--
You are receiving this mail because:
You are on the CC list for the bug.