Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go 1.4.2 cannot parse some dns responses from consul #854

Closed
fedyakin opened this issue Apr 9, 2015 · 16 comments
Closed

go 1.4.2 cannot parse some dns responses from consul #854

fedyakin opened this issue Apr 9, 2015 · 16 comments
Labels
type/bug Feature does not function as expected

Comments

@fedyakin
Copy link

fedyakin commented Apr 9, 2015

This can be reproduced by calling net.LookupHost("metrics-api.librato.com") with go 1.4.2 using consul as the dns server.

After some investigation, we believe the problem is that the udp dns response exceeds the 512 byte limit specified in rfc 1035, and does not set the truncate flag. The go 1.4.2 implementation of the dns client cannot parse such responses. This can be seen by looking at readDNSResponse starting on line 41 of net/dnsclient_unix.go. The function uses a fixed size buffer, and will only pass part of the message to the deserialization function. Prior versions of go used a 2000 byte buffer as seen here.

Setting the dns_config.enable_truncate flag in consul configuration did not resolve this problem. It appears that the flag is only used when sending srv record responses.

The frequency of this problem could be lowered by using dns message compression (described in section 4.1.4 of rfc 1035). The response from 8.8.8.8 for the lookup is close to 500 bytes smaller than from consul, and well under the 512 byte limit.

@cjhubert
Copy link

This is a problem for people compiling with CGO_ENABLED=0 and using Consul for DNS in Go 1.4+.

I've created a gist to show the exact code we're running in the output below.

In Go 1.3.3, the above code outputs the following. Note that the 401 response is correct, I just wanted to show that the DNS resolves.

2015/04/10 17:08:56 (go1.3.3)
2015/04/10 17:08:56 CNAME: metrics-prod-1711139259.us-east-1.elb.amazonaws.com.
2015/04/10 17:08:56 Hosts: [23.23.110.30 54.225.85.31 23.21.91.141 54.235.144.213 50.19.125.98 54.235.110.130 23.23.89.169 54.243.232.250]
2015/04/10 17:08:56 Response: &{401 Unauthorized 401 HTTP/1.1 1 1 map[Content-Type:[application/json;charset=utf-8] Date:[Fri, 10 Apr 2015 17:08:56 GMT] Server:[nginx] Status:[401 Unauthorized] Www-Authenticate:[Basic realm="Librato API"] Content-Length:[49] Connection:[keep-alive]] 0xc2081693c0 49 [] false map[] 0xc20803a4e0 0xc208071f80}

In Go 1.4.2, with the same exact code this is output:

2015/04/10 17:09:11 (go1.4.2)
2015/04/10 17:09:11 CNAME: metrics-prod-1711139259.us-east-1.elb.amazonaws.com.
2015/04/10 17:09:11 Host Error: lookup metrics-api.librato.com on 10.1.42.1:53: cannot unmarshal DNS message
2015/04/10 17:09:11 Hosts: []
2015/04/10 17:09:11 HTTP Error: Get https://metrics-api.librato.com/v1/snapshots/1: dial tcp: lookup metrics-api.librato.com on 10.1.42.1:53: cannot unmarshal DNS message
2015/04/10 17:09:11 Response: <nil>

Note the error: cannot unmarshal DNS message. According to tcpdumps, the response we're getting back from Consul DNS is ~700 bytes. After the change to readDNSResponse() that @fedyakin linked, the max response must be under 512 bytes.

The same DNS response when using 8.8.8.8 is only ~200 bytes, since it's using message compression.

We're using CGO_ENABLED=0 since we run in FROM scratch docker containers.

For now, we've found a workaround. It involves setting the Consul config to not recurse (before, it was set to 8.8.8.8.

{
  "recursor": ""
}

And then, when running the docker container, setting the first DNS as Consul (which is advertised on the docker0 interface, and setting a fallback of 8.8.8.8.

docker run --dns $(ip -o -4 addr show docker0 | awk -F '[ /]+' '{print $4}') --dns 8.8.8.8

Edit: For anyone who is using the workaround, note that's only good if you don't actually need Consul to recurse through DNS. I would look at #971 if you do need it to recurse (though, while using #971, Consul seems to panic from #1023)

@armon
Copy link
Member

armon commented Apr 10, 2015

Tagging as bug, it may just have to do with us sending back a non-compressed DNS response

@discordianfish
Copy link
Contributor

@armon Just FYI, we (Docker) just changed the registry dns name to a CNAME which means this bug (or netgo but, see my linked issue) is breaking everyone using docker and consul with recursor right now.

@discordianfish
Copy link
Contributor

Okay, I think I found the actual problem: miekg/dns#216
Consul takes the message it got from Exchange() and just writes that back to the client. Since the upstream message was compressed, it fits into 512 bytes but since consul sents it back uncompressed it's bigger and struct stub resolvers (like netgo) rejects it.

@slackpad
Copy link
Contributor

Realized that this issue is now being discussed over on #971, I'll link this for context but close this in favor of that issue, which has the latest discussion.

@slackpad
Copy link
Contributor

Actually since that's a PR I'll leave this open until we merge a fix.

@slackpad slackpad reopened this Mar 11, 2016
@hermansc
Copy link

hermansc commented May 13, 2016

Having this problem too. Using a recursor and trying to ping a host where the DNS record is ~2500 bytes I get:

[ERR] dns: recurse failed: dns: failed to unpack truncated message
[ERR] dns: all resolvers failed for {sd.example.com. 1 1} from client 172.17.xxx.xxx:50313 (udp)

Setting enable_truncate makes no difference as it seems like that flag is only passed when not doing a lookup using the recursor? (This is just from glancing at dns.go)

I think either of these would work:

  • adding an EDNS0 OPT RR that will advertise a larger buffer - setting SetEdns0
  • Fallback to TCP queries if the DNS fails (or being able to force recurse queries to go over TCP - currently UDP is hard-coded)

It seems development on #971 has stopped, and it's probably not the right solution anyways as very large responses (such as the 2.5kB one I have) will still potentially go over the 512 byte limit.

@beneshed
Copy link

is there a work around?

@hermansc
Copy link

@thebenwaters it's a horrible one, but you can run your container in host network mode I think.

@discordianfish
Copy link
Contributor

@hermansc I don't think this makes any difference. But there might be unrelated udp issues with Docker:

@tmichaud314
Copy link

@slackpad Hi. Is there a resolution for this one in the works?

@tkambler
Copy link

tkambler commented Aug 17, 2016

I just spent the last 8 hours trying to track this crap down, then finally found this. Damnit.

@slackpad
Copy link
Contributor

Sorry I forgot to close this - it was fixed via #971 and #2096, in addition to the size trimming added in Consul 0.6.4. These fixes will go out in Consul 0.7.

@tkambler
Copy link

@slackpad Are these fixes in the master branch?

@slackpad
Copy link
Contributor

slackpad commented Aug 17, 2016

@tkambler yes they are in master - please let me know if you still see any issues.

DNS compression is opt-out so you shouldn't need to do anything special to enable it with a later build. Earlier today we announced a release candidate build of 0.7 that's not ready for production yet, but has these fixes if you'd like to test using that - https://groups.google.com/d/msg/consul-tool/7KDuvdwNpi0/LSY5LiPnCwAJ.

@tkambler
Copy link

Sure enough - swapping out Consul for v0.7rc immediately resolves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
9 participants