This is the mail archive of the
libc-help@sourceware.org
mailing list for the glibc project.
Re: Indefinite hang in getaddrinfo / check_pf / make_request
- From: Steven Schlansker <stevenschlansker at gmail dot com>
- To: Paul Pluzhnikov <ppluzhnikov at google dot com>, Adhemerval Zanella <adhemerval dot zanella at linaro dot org>
- Cc: "libc-help at sourceware dot org" <libc-help at sourceware dot org>
- Date: Tue, 29 Sep 2015 15:05:02 -0700
- Subject: Re: Indefinite hang in getaddrinfo / check_pf / make_request
- Authentication-results: sourceware.org; auth=none
- References: <530A3DC3-298A-4D7B-9736-5EF7575B51CB at gmail dot com> <CALoOobPjfRqzzbaZ35ibB1jyZ6yZsG5YGr2YYqOJjADG-oQYXA at mail dot gmail dot com> <BFF025C7-AAFF-401E-A5C2-E40C880DAEC0 at gmail dot com> <35B15C41-3A44-4CE6-BFA2-55B85EB396A4 at gmail dot com>
On Sep 24, 2015, at 11:36 AM, Steven Schlansker <stevenschlansker@gmail.com> wrote:
>
> On Sep 22, 2015, at 9:59 PM, Steven Schlansker <stevenschlansker@gmail.com> wrote:
>
>>
>>> On Sep 22, 2015, at 9:04 PM, Paul Pluzhnikov <ppluzhnikov@google.com> wrote:
>>>
>>> On Tue, Sep 22, 2015 at 8:53 PM, Steven Schlansker
>>> <stevenschlansker@gmail.com> wrote:
>>>
>>>> We found the following issue:
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=15946
>>>
>>> You may be seeing https://sourceware.org/bugzilla/show_bug.cgi?id=12926 instead.
>>>
>>> See if that patch has been applied to your sources as well.
>>
>> Thanks for finding this. While that fix is not applied to our deployed version,
>> I think the symptoms are slightly different
>
> Thanks Paul and Adhemerval for the advice. I believe I have evidence that this is
> not the same issue as either 15946 or 12926.
> ...
>
> I am going to spend some time trying to distill down a test case that just exercises the check_pf code and see if I can reproduce in isolation.
> In the meantime, does anyone have any ideas for further diagnostics that would be useful? I'm not sure how to check the kernel side of the netlink socket effectively,
> to see if it actually tried to reply or not...
Hello again, in case anyone stumbles across this in the future --
I got a test case, and narrowed it down further. It seems to be related
to incorrect kernel handling of the netlink sockets; under contention
they can get lost:
https://lkml.org/lkml/2015/9/24/712
Kernel 4.0.4 is known to be affected. We're testing out 4.0.9
in the hopes it is not. So this is in fact a new bug, albeit
not a glibc bug.
Thank you for your time.