FS#23052 - [glibc] getaddrinfo() support for IPv6 DNS servers is broken (huge latency)

Attached to Project: Arch Linux
Opened by Andrej Podzimek (andrej) - Friday, 25 February 2011, 22:40 GMT
Last edited by Allan McRae (Allan) - Monday, 11 April 2011, 04:13 GMT
Task Type Bug Report
Category Packages: Core
Status Closed
Assigned To Allan McRae (Allan)
Architecture All
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

IPv6-only DNS servers (and IPv6-only networks) are currently almost unusable with all programs that rely on the getaddrinfo() functionality. There are huge latencies of >5 seconds with an IPv6 DNS server.

Interesting facts:
* dig works *perfectly* in exactly the same environment, there is no latency
* there is no latency when observing a test application with strace, but a latency always occurs without strace (!)
* there is no latency with an IPv4 DNS server

The fact that strace removes the latency is really *surprising*, to say the least. You can use the attached gaitest.c snippet to observe this. A couple of examples can be found below. Unfortunately, this makes it impossible to diagnose the issue with strace. :-(

Additional info:

* package version(s)

glibc 2.13-4
(The issue has existed for >3 months, AFAIK, so the exact glibc version may not matter.)

* config and/or log files etc.

These are the odd latencies that disappear with strace:

$ time ./gaitest ipv6.google.com
2a00:1450:8004::93

real 0m5.037s
user 0m0.003s
sys 0m0.000s

$ time strace ./gaitest ipv6.google.com 2>/dev/null
2a00:1450:8004::93

real 0m0.011s
user 0m0.007s
sys 0m0.000s

Please note that both of these results are 100% reproducible, so they are not related to any caches. And as already mentioned, this only occurs when an IPv6 DNS server is configured. DNS over IPv4 does not have this issue.

My configuration files follow.

$ cat /etc/resolv.conf
nameserver 2002:****:****:1::1

$ cat /etc/host.conf
order hosts,bind
multi on

$ cat /etc/nsswitch.conf
passwd: files
group: files
shadow: files
publickey: files
hosts: files dns
networks: files
protocols: files
services: files
ethers: files
rpc: files
netgroup: files

$ cat /etc/hosts
::1 localhost charonng
127.0.0.1 localhost charonng

$ cat /etc/gai.conf
# This file is empty. I experimented with the default file and with multiple modifications thereof, but both issues are still the same.

Steps to reproduce:

1) Set your /etc/resolv.conf to use an IPv6 DNS server.
2) Try to resolve an address using getaddrinfo(). (You will see huge latencies in Firefox, for instance.)
This task depends upon

Closed by  Allan McRae (Allan)
Monday, 11 April 2011, 04:13 GMT
Reason for closing:  No response
Additional comments about closing:  Requires someone with the correct setup to do the git bisect. Request this to be reopened once that is done.
Comment by Allan McRae (Allan) - Saturday, 26 February 2011, 00:24 GMT
Probably a duplicate of  FS#20470 
Comment by Andrej Podzimek (andrej) - Saturday, 26 February 2011, 15:11 GMT
I don't think this is a duplicate of  FS#20470 .

In my case, the DNS server is on the same hardware switch as the machines that generate queries. There is no intermediate DNS proxy. The machine with the DNS server has not been updated for months and it had always worked fine, without observable latencies, before this issue emerged. So this is probably not a server-side issue. Clients used to work just fine a couple of weeks (months?) ago, but then users started to observe these huge latencies. The problem has not been reported immediately, since everybody thought it was just a *temporary* issue on the network or the like.

There are also other major differences. In this case,

1) the latency always takes almost exactly 5 seconds, there are no repeated queries.
2) it does *not* matter whether glibc requests A and/or AAAA records. The latency is still the same.
3) it does matter how the DNS communication is transported. DNS over IPv6 causes the delay, whereas DNS over IPv4 works fine.
4) using strace removes the DNS over IPv6 latency (which is probably the most surprising fact).
Comment by Andrej Podzimek (andrej) - Saturday, 26 February 2011, 17:28 GMT
Perhaps I forgot to stress one more important fact: This is *only* a getaddrinfo() issue. Programs that do not use getaddrinfo(), such as Opera (and dig, of course) work flawlessly.
Comment by Rémy Oudompheng (remyoudompheng) - Saturday, 26 February 2011, 17:38 GMT
You do not specify where the latency comes from : is it on the server side or the client side ? Can you log the exact timestamps of when the query is sent and the answer is received?
Comment by Allan McRae (Allan) - Saturday, 26 February 2011, 21:19 GMT
Ah - the key information here is that this used to work. Although the original report says it has been broken for >3 months and the comment says it was "fine a couple of weeks (months?) ago" so the timeline is a bit screwy.

Anyway, given this used to work, you can git bisect the issue and find the upstream change that causes it. My guess is this was the glibc-2.12 to 2.13 update so that should give you starting points. I can not take this bug much further without that being done.
Comment by Andrej Podzimek (andrej) - Saturday, 26 February 2011, 21:40 GMT
These files can be inspected in Wireshark. They illustrate (quite precisely) what is happening. Obviously, strace dramatically changes the behavior of the observed application. (!) Opera works fine, no serious delays.

And yes, there might be some similarity to  FS#20470  ... at least guessing by the wireshark output. getaddrinfo() really generates an obsolete second query when used *without* strace. getaddrinfo() under strace does not do this. Opera generates quite a lot of queries, but there are no problems with delays.

BTW, how can an application find out that it is strace'd? I thought this should not be possible, at least for a non-root program. But obviously, my gaitest program does this...
Comment by Allan McRae (Allan) - Monday, 14 March 2011, 13:30 GMT
If this is going to be fixed, someone with the correct setup to test for the issue will need to do the git bisect and find the patch that caused this issue. Until then, I can not follow-up with upstream and get it fixed.

Loading...