FS#47249 : [glibc] GNU DNS resolver fails to resolve for unknown reasons

FS#47249 - [glibc] GNU DNS resolver fails to resolve for unknown reasons

Attached to Project: Arch Linux
Opened by Steffen Nurpmeso (sdaoden) - Thursday, 03 December 2015, 14:32 GMT
Last edited by Allan McRae (Allan) - Monday, 16 May 2016, 12:36 GMT

Task Type	Bug Report
Category	Packages: Core
Status	Closed
Assigned To	Allan McRae (Allan)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Description:

I see this since i am working regulary on Arch. The problem shows up like this:

s-nail: Lookup of "pop.yandex.ru:pop3s" failed: Name or service not known
s-nail: Trying standard protocol port "995"
If that succeeds consider including the port in the URL!
^\
=== Command detached from window (Wed Dec 2 19:17:45 2015) ===

In a virgin program the first DNS lookup may fail -- if that happens all further lookups will fail. If it doesn't happen for the first lookup, it won't happen at all.
This makes me think the GNU DNS resolver fails to deal with signals properly. I'll try to add a better work-around for S-nail v14.9 (the last obviously didn't work), but despite this it's a problem of the resolver anyway.
For the above i have a stack trace:

Dec 02 19:17:45 wales systemd-coredump[27169]: Process 30549 (s-nail) of user 1000 dumped core.

Stack trace of thread 30549:
#0 0x00007f0d64f3a170 poll (/usr/lib/libc-2.22.so)
#1 0x00007f0d63619c81 n/a (/usr/lib/libresolv-2.22.so)
#2 0x00007f0d63617c22 __libc_res_nquery (/usr/lib/libresolv-2.22.so)
#3 0x00007f0d63618265 n/a (/usr/lib/libresolv-2.22.so)
#4 0x00007f0d63618721 __libc_res_nsearch (/usr/lib/libresolv-2.22.so)
#5 0x00007f0d62bd7ae9 _nss_dns_gethostbyname4_r (/usr/lib/libnss_dns-2.22.so)
#6 0x00007f0d64f2dff1 gaih_inet (/usr/lib/libc-2.22.so)
#7 0x00007f0d64f2ff9e getaddrinfo (/usr/lib/libc-2.22.so)
#8 0x00000000080df29e n/a (n/a)

Additional info:
* package version(s)
* config and/or log files etc.

My system is current.

Steps to reproduce:

I have no idea.

This task depends upon

Closed by Allan McRae (Allan)
Monday, 16 May 2016, 12:36 GMT
Reason for closing: None
Additional comments about closing: "fixed"

Comment by Christian Hesse (eworm) - Thursday, 03 December 2015, 14:45 GMT

Any DNS caching software in action? dnsmasq? pdnsd?

Comment by Steffen Nurpmeso (sdaoden) - Thursday, 03 December 2015, 19:36 GMT

No. This is a VM, however, NAT and port forwarding of only SSH. But i wonder a bit why this should matter? I mean yes, this terrible internet provider really has problems with DNS and i sometimes have to disconnect and reconnect to get proper DNS, but the HOST can and the VM can, but getaddrinfo(3) via S-nail may behave as above. I suspect that the GNU resolver either does not isolate against inherited signal settings or sets some first-time thing and then can't recover if the resulting state is not proper. E.g., even if there is only one known DNS server and that fails to respond it should be asked again since there is no other one to ask, QOS doesn't matter there. Just an example. I really have no idea. But the problem happens to happen. :(

Comment by Christian Hesse (eworm) - Friday, 04 December 2015, 07:05 GMT

Once I had similar symptoms with pdnsd caching daemon in between. So this is probably unrelated.

For completeness, here's details about my issue:
The client sends A and AAAA UDP queries. If either of both responses comes back truncated, it sends a TCP query with both, A *and* AAAA queries. By default pdnsd answers just one query, breaking name resolving. I prepared a patch, but that did not (yet) get applied upstream. Sadly pdnsd is maintained pretty bad, and I switched to use dnsmasq.
https://github.com/eworm-de/pdnsd/commit/8511b2a97c04ee235b0799db47fa420bd4a23b87

Comment by Steffen Nurpmeso (sdaoden) - Friday, 04 December 2015, 10:22 GMT

Hi. Hm, i don't think this is related. Though the resolver has to use what it gets?

The referenced commit limits TCP connection reusability? I haven't actually traced the network I/O that causes the above issue, but i don't think that TCP is at all involved. I've added the stack trace only due to frustration, and to show where we are QoSing. I.e., if it happens to happen, _all_ further requests fail. This is what this issue is about. I need to quit and restart S-nail because getaddrinfo(3) will never succeed again, even though other programs in the VM will be able to resolve.

Regarding truncation our own DNS resolver treats each request by itself.

in PckParse
// if trunc and trunc not allowed, use TCP with same server
// though "truncated datagrams are usually correct if just any entry
// is present is AUTHORITY". we do so because of
// 1) EDNS (OPT always in additional)
// 2) a possibly set security hook (TKEY/TSIG always in additional)
// 3) RFC 1123 6.1.3.2 says that we must not cache RRs of such packets
// in a way that the fact that they came from truncated packet is
// lost. in addition "mailers MUST NOT use a truncated MX response
// at all due to the fact that this could lead to mail loops".
// ('could take care of *where* truncation occurs here, though.
// but this interferes with 2) again..
// in addition RFC 2181, 9, says that TC should not be set for
// truncation in ADDITIONAL.
// well, evolving standards are one thing AND the other one, too...)
ret = pckp_needs_tcp;
if(_qrb->pctx.ph->tc && !(QL(ql)->conf_flags & qf_conf_igntc)) {
_LOG0(("\t<> Truncated and conf_igntc is not set..%R"));
goto jout;
}

called by HdlPacket

case pckp_needs_tcp:
// TCP only
// - if not already
// - if server is the current one (otherwise route too loaded)
// - if TCP succeeds
// now - what to do if TCP truncates???
if(!_q && qrb.iscurr && a_NeedsTCP(&qrb))
break;

and NeedsTCP starts with

_LOG0(("\t- Response is truncated and conf_igntc is not set.%R"
"\t (Or conf_igntc is set but truncation occurred before "
"answer RR.)%R"
"\t Trying a TCP connection with same server...%R"));

so then we enter TCP. But packets are individuals, anyway. :)
All this non-blocking and event based via I/O monitors, program has to be driven via an event loop.
And all this has nothing to do with this issue, of course :)

Comment by Dave Reisner (falconindy) - Friday, 04 December 2015, 13:59 GMT

> This makes me think the GNU DNS resolver fails to deal with signals properly.
What signal did it fail to handle? What would "properly" be?

Have you modified your /etc/nsswitch.conf? Could you attach it? The initial error you cite sounds more like a services lookup failure, not a DNS failure. The stack trace you include doesn't include the termination of the program, either.

Comment by Steffen Nurpmeso (sdaoden) - Friday, 04 December 2015, 17:37 GMT

I never looked into GNU resolver code. Does it uses child processes? Well, then it could be that S-nail consumes a SIGCHLD that the resolver is waiting for. Just an idea of mine. My thought was that this error is so dramatic that it would have reported a billion times since September, so what does the old BSD Mail codebase that could interfere with GNU resolving? Signal handling.

What would properly be? Well often software takes into account an inherited signal state: if that is SIG_IGN it doesn't do anything. And see first paragraph, please.

NAh, like i said, it really is all standard. And yes it does? I explicitly sent the QUIT signal to force termination at the point were GNU ends up. And no it doesn't, it is just S-nail trying to make some sense from a fuzzy resolver error -- since the /etc/services file on ArchLinux knows about the POP3S protocol the above message looks wrong, and indeed on ArchLinux it doesn't make sense to retry the lookup with an explicit port specification. On other OSs with different /etc/services it does; and S-nail uses standard libraries, so parsing /etc/services to avoid the problem is no option.

# Begin /etc/nsswitch.conf

passwd: files
group: files
shadow: files

publickey: files

hosts: files dns myhostname
networks: files

protocols: files
services: files
ethers: files
rpc: files

netgroup: files

# End /etc/nsswitch.conf

Comment by Dave Reisner (falconindy) - Friday, 04 December 2015, 17:59 GMT

> I never looked into GNU resolver code. Does it uses child processes?
No, it doesn't. You can see exactly what it does if you run something like 'strace getent hosts www.google.com'. Essentially:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
sendto(3, "\307#\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\34\0\1", 32, MSG_NOSIGNAL, NULL, 0) = 32
poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [60]) = 0
recvfrom(3, "\307#\201\200\0\1\0\1\0\0\0\0\3www\6google\3com\0\0\34\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 60

> I explicitly sent the QUIT signal to force termination at the point were GNU ends up
So the point where GNU "ends up" is just waiting in poll(3) for socket connection or DNS response? I'm confused. You keep going back to lack of proper signal handling, but the only signal involved here seems to be the one *you* sent, forcing termination. Almost any response to SIGABRT other than dumping core would be *improper*, in my book.

Comment by Steffen Nurpmeso (sdaoden) - Friday, 04 December 2015, 18:08 GMT

Mr.! So it doesn't use signals. Then this isn't the reason, right.
ABRT seems to be the right thing to do.

Comment by Dave Reisner (falconindy) - Friday, 04 December 2015, 18:13 GMT

So then let's talk about your DNS server, as eworm suggested...

Comment by Allan McRae (Allan) - Saturday, 20 February 2016, 01:57 GMT

Is there still an issue here?

Comment by Steffen Nurpmeso (sdaoden) - Saturday, 20 February 2016, 11:19 GMT

This monday, yes. (I -Sy/-Su each saturday.)

Comment by Allan McRae (Allan) - Saturday, 20 February 2016, 11:37 GMT

Do you have DNS issues outside of S-nail?

Comment by Steffen Nurpmeso (sdaoden) - Saturday, 20 February 2016, 12:59 GMT

The other uses in my ArchLinux VM mostly don't use the resolver but talk DNS directly, except for curl, but, without having looked, i guess that uses c-ares?
So the answer is no. But S-nail really only calls getaddrinfo():

https://git.sdaoden.eu/cgit/s-nail.git/tree/fio.c#n1985

I will debug more for v14.9, unfortunately S-nail will continue using BSD-style signal handling even then.
Interestingly i never ever encountered this error when only sending mail via SMTP, but only in interactive mode, so that is why i think this direction...

P.S.: thanks for the ntohs(), the !HAVE_GETADDRINFO branch is indeed broken in S-nail since many months.

Comment by Dave Reisner (falconindy) - Saturday, 20 February 2016, 13:02 GMT

curl does not use c-ares, it calls getaddrinfo.

Comment by Steffen Nurpmeso (sdaoden) - Saturday, 20 February 2016, 13:20 GMT

Well; if it's really only us then i will use a version that temporarily reinstalls the default SIGCHLD handler.. and what.. and see wether this changes anything.

Comment by Steffen Nurpmeso (sdaoden) - Saturday, 20 February 2016, 13:24 GMT

P.S.: and yes i _do_ have problems with DNS were i live, i'm connected wirelessly and the connection is often weird, worse, the local carrier seems to like to disconnect so that we reconnect again, with an E-Plus net winning (which would cause roaming and make things expensive when used). So yes, i do have problems with DNS. But if i reconnect to the net i can also reconnect, e.g., SSH or the browser, whereas getaddrinfo() in S-nail is a dead-end when this problem occurs.

Comment by Steffen Nurpmeso (sdaoden) - Monday, 22 February 2016, 10:13 GMT

Still happens with glibc 2.23-1 (with unchanged mailx). (Just to mention that i always use the devel version which doesn't use alloca, fwiw.)

Comment by Allan McRae (Allan) - Monday, 22 February 2016, 10:20 GMT

I suggest you try the libc-help mailing list.

Comment by Steffen Nurpmeso (sdaoden) - Monday, 16 May 2016, 12:27 GMT

Hello!
I haven't seen this error since the begin of May.
Looks like this bug can be closed.

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#47249 - [glibc] GNU DNS resolver fails to resolve for unknown reasons

Details

Loading...