Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#47249 - [glibc] GNU DNS resolver fails to resolve for unknown reasons
Attached to Project:
Arch Linux
Opened by Steffen Nurpmeso (sdaoden) - Thursday, 03 December 2015, 14:32 GMT
Last edited by Allan McRae (Allan) - Monday, 16 May 2016, 12:36 GMT
Opened by Steffen Nurpmeso (sdaoden) - Thursday, 03 December 2015, 14:32 GMT
Last edited by Allan McRae (Allan) - Monday, 16 May 2016, 12:36 GMT
|
DetailsDescription:
I see this since i am working regulary on Arch. The problem shows up like this: s-nail: Lookup of "pop.yandex.ru:pop3s" failed: Name or service not known s-nail: Trying standard protocol port "995" If that succeeds consider including the port in the URL! ^\ === Command detached from window (Wed Dec 2 19:17:45 2015) === In a virgin program the first DNS lookup may fail -- if that happens all further lookups will fail. If it doesn't happen for the first lookup, it won't happen at all. This makes me think the GNU DNS resolver fails to deal with signals properly. I'll try to add a better work-around for S-nail v14.9 (the last obviously didn't work), but despite this it's a problem of the resolver anyway. For the above i have a stack trace: Dec 02 19:17:45 wales systemd-coredump[27169]: Process 30549 (s-nail) of user 1000 dumped core. Stack trace of thread 30549: #0 0x00007f0d64f3a170 poll (/usr/lib/libc-2.22.so) #1 0x00007f0d63619c81 n/a (/usr/lib/libresolv-2.22.so) #2 0x00007f0d63617c22 __libc_res_nquery (/usr/lib/libresolv-2.22.so) #3 0x00007f0d63618265 n/a (/usr/lib/libresolv-2.22.so) #4 0x00007f0d63618721 __libc_res_nsearch (/usr/lib/libresolv-2.22.so) #5 0x00007f0d62bd7ae9 _nss_dns_gethostbyname4_r (/usr/lib/libnss_dns-2.22.so) #6 0x00007f0d64f2dff1 gaih_inet (/usr/lib/libc-2.22.so) #7 0x00007f0d64f2ff9e getaddrinfo (/usr/lib/libc-2.22.so) #8 0x00000000080df29e n/a (n/a) Additional info: * package version(s) * config and/or log files etc. My system is current. Steps to reproduce: I have no idea. |
This task depends upon
Closed by Allan McRae (Allan)
Monday, 16 May 2016, 12:36 GMT
Reason for closing: None
Additional comments about closing: "fixed"
Monday, 16 May 2016, 12:36 GMT
Reason for closing: None
Additional comments about closing: "fixed"
For completeness, here's details about my issue:
The client sends A and AAAA UDP queries. If either of both responses comes back truncated, it sends a TCP query with both, A *and* AAAA queries. By default pdnsd answers just one query, breaking name resolving. I prepared a patch, but that did not (yet) get applied upstream. Sadly pdnsd is maintained pretty bad, and I switched to use dnsmasq.
https://github.com/eworm-de/pdnsd/commit/8511b2a97c04ee235b0799db47fa420bd4a23b87
The referenced commit limits TCP connection reusability? I haven't actually traced the network I/O that causes the above issue, but i don't think that TCP is at all involved. I've added the stack trace only due to frustration, and to show where we are QoSing. I.e., if it happens to happen, _all_ further requests fail. This is what this issue is about. I need to quit and restart S-nail because getaddrinfo(3) will never succeed again, even though other programs in the VM will be able to resolve.
Regarding truncation our own DNS resolver treats each request by itself.
in PckParse
// if trunc and trunc not allowed, use TCP with same server
// though "truncated datagrams are usually correct if just any entry
// is present is AUTHORITY". we do so because of
// 1) EDNS (OPT always in additional)
// 2) a possibly set security hook (TKEY/TSIG always in additional)
// 3) RFC 1123 6.1.3.2 says that we must not cache RRs of such packets
// in a way that the fact that they came from truncated packet is
// lost. in addition "mailers MUST NOT use a truncated MX response
// at all due to the fact that this could lead to mail loops".
// ('could take care of *where* truncation occurs here, though.
// but this interferes with 2) again..
// in addition RFC 2181, 9, says that TC should not be set for
// truncation in ADDITIONAL.
// well, evolving standards are one thing AND the other one, too...)
ret = pckp_needs_tcp;
if(_qrb->pctx.ph->tc && !(QL(ql)->conf_flags & qf_conf_igntc)) {
_LOG0(("\t<> Truncated and conf_igntc is not set..%R"));
goto jout;
}
called by HdlPacket
case pckp_needs_tcp:
// TCP only
// - if not already
// - if server is the current one (otherwise route too loaded)
// - if TCP succeeds
// now - what to do if TCP truncates???
if(!_q && qrb.iscurr && a_NeedsTCP(&qrb))
break;
and NeedsTCP starts with
_LOG0(("\t- Response is truncated and conf_igntc is not set.%R"
"\t (Or conf_igntc is set but truncation occurred before "
"answer RR.)%R"
"\t Trying a TCP connection with same server...%R"));
so then we enter TCP. But packets are individuals, anyway. :)
All this non-blocking and event based via I/O monitors, program has to be driven via an event loop.
And all this has nothing to do with this issue, of course :)
What signal did it fail to handle? What would "properly" be?
Have you modified your /etc/nsswitch.conf? Could you attach it? The initial error you cite sounds more like a services lookup failure, not a DNS failure. The stack trace you include doesn't include the termination of the program, either.
What would properly be? Well often software takes into account an inherited signal state: if that is SIG_IGN it doesn't do anything. And see first paragraph, please.
NAh, like i said, it really is all standard. And yes it does? I explicitly sent the QUIT signal to force termination at the point were GNU ends up. And no it doesn't, it is just S-nail trying to make some sense from a fuzzy resolver error -- since the /etc/services file on ArchLinux knows about the POP3S protocol the above message looks wrong, and indeed on ArchLinux it doesn't make sense to retry the lookup with an explicit port specification. On other OSs with different /etc/services it does; and S-nail uses standard libraries, so parsing /etc/services to avoid the problem is no option.
# Begin /etc/nsswitch.conf
passwd: files
group: files
shadow: files
publickey: files
hosts: files dns myhostname
networks: files
protocols: files
services: files
ethers: files
rpc: files
netgroup: files
# End /etc/nsswitch.conf
No, it doesn't. You can see exactly what it does if you run something like 'strace getent hosts www.google.com'. Essentially:
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
sendto(3, "\307#\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\34\0\1", 32, MSG_NOSIGNAL, NULL, 0) = 32
poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [60]) = 0
recvfrom(3, "\307#\201\200\0\1\0\1\0\0\0\0\3www\6google\3com\0\0\34\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 60
> I explicitly sent the QUIT signal to force termination at the point were GNU ends up
So the point where GNU "ends up" is just waiting in poll(3) for socket connection or DNS response? I'm confused. You keep going back to lack of proper signal handling, but the only signal involved here seems to be the one *you* sent, forcing termination. Almost any response to SIGABRT other than dumping core would be *improper*, in my book.
ABRT seems to be the right thing to do.
So the answer is no. But S-nail really only calls getaddrinfo():
https://git.sdaoden.eu/cgit/s-nail.git/tree/fio.c#n1985
I will debug more for v14.9, unfortunately S-nail will continue using BSD-style signal handling even then.
Interestingly i never ever encountered this error when only sending mail via SMTP, but only in interactive mode, so that is why i think this direction...
P.S.: thanks for the ntohs(), the !HAVE_GETADDRINFO branch is indeed broken in S-nail since many months.
I haven't seen this error since the begin of May.
Looks like this bug can be closed.