This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug libc/17802] New: The DNS resolver gets stuck when one nameserver is down


https://sourceware.org/bugzilla/show_bug.cgi?id=17802

            Bug ID: 17802
           Summary: The DNS resolver gets stuck when one nameserver is
                    down
           Product: glibc
           Version: 2.19
            Status: NEW
          Severity: normal
          Priority: P2
         Component: libc
          Assignee: unassigned at sourceware dot org
          Reporter: luto at mit dot edu
                CC: drepper.fsp at gmail dot com

I have a production deployment.  resolv.conf contains:

nameserver A
nameserver B

gai.conf has only comments (default Ubuntu config).  nsswitch.conf is the
default, which contains:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

I have long-running programs that use getaddrinfo to resolve the same hostname
over and over.  It's using Python:

socket.getaddrinfo('host.name.com', '22', socket.AF_UNSPEC, socket.SOCK_STREAM)

When nameserver A goes down, which happens every now and then, frequently one
or more of the long-running services will have that getaddrinfo call fail every
time.  When this is happening, a packet capture shows this sequence of events:

t=0: query to A
t = 5.005 seconds: query to B
t = 5.053153 seconds: response from B (looks valid to me)
t = 5.053208 seconds: new query to A
t = 10.058251 seconds: new query to B
t = 10.076853 seconds: response from B
t = 10.076908 seconds: query for host.name.com.my.domain.name to A
t = 15.080450 seconds: query for host.name.com.my.domain.name to B
t = 15.099330 seconds: NXDOMAIN from B
t = 15.099387 seconds: query for host.name.com.my.domain.name to A
t = 20.104418 seconds: query for host.name.com.my.domain.name to B
t = 20.123345 seconds: NXDOMAIN from B

At this point, I get "gaierror: [Errno -2] Name or service not known" back from
Python.

>From the timing, it looks like glibc is considering the actual valid responses
from B to be failures.

If I restart the service with nameserver A still down, everything works (it
tries A, then tries B 5 seconds later and accepts the answer).

The system in question does not use iptables.

I can't consistently reproduce this, unfortunately.  I can try to run different
diagnostics the next time it happens, though.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]