Jostle logic seems to randomly stop working

W.C.A. Wijngaards wouter at nlnetlabs.nl
Wed May 23 06:11:09 UTC 2018


Hi Dmitry,

On 19/05/18 03:59, Dmitri Kourennyi via Unbound-users wrote:
> More investigation results:
> 
> I think the issue appears when unbound is trying to probe the master
> servers for
> the auth_zone section. The logs show unbound doing lookups on all the
> auth_zone
> domain names in my config, and I think unbound is answering them from
> its own
> cache. After the lookups, I get the following in the logs:

Can you show the work that it does for looking up one of the root
servers?  Not getting an address must be causing it to not have content.
 But it does work (eventually), you say, once the long list appears,
that means the AXFR has completed.  Meanwhile you should have normal
service, because the zone is loaded (the file that is configured has
content, right?)?  When a normal query arrives, it should just be
answered with the auth-zone?

The bug that was fixed in 1.7.1 (causes problem now?), supposedly fixes
behaviour with respect to the forward-zones configured.  Is that still
not right somehow?  Note that having a forward zone for "." and also an
auth-zone 7706 for the root, in 1.7.1 answers only queries for the root
itself from the root (only domain ".") and other queries from the
forward-zone.  Where in 1.7.0 it would pick the auth-zone referral and
go make authoritative queries (and that was a bug and fixed).  So, if
1.7.1 does not work, perhaps authoritative queries work, but the
forward-zone does not work so well.  And if you remove that, then
unbound starts making authoritative queries again.

That the root zone is downloaded every half hour is normal, that is
exactly the AXFR of the root zone that the auth-zone is supposed to do.
So that seems to be working fine and is keeping the root zone up to date.

Best regards, Wouter

> 
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end
> 13 recursion states (462 with reply, 13 detached), 13 waiting replies,
> 1444 recursion replies sent, 1 replies dropped, 0 states jostled out
> [*snip* histogram]
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 0RDd mod0 
> b.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 1RDd mod0 
> c.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 2RDd mod0 
> f.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 3RDd mod0 
> g.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 4RDd mod0 
> k.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 5RDd mod0 
> xfr.cjr.dns.icann.org. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 6RDd mod0 
> xfr.lax.dns.icann.org. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 7RDd mod0 
> b.root-servers.net. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 8RDd mod0 
> c.root-servers.net. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 9RDd mod0 
> g.root-servers.net. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 10RDd mod0 
> k.root-servers.net. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 11RDd mod0 
> xfr.cjr.dns.icann.org. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 12RDd mod0 
> xfr.lax.dns.icann.org. AAAA IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end
> 12 recursion states (462 with reply, 12 detached), 12 waiting replies,
> 1444 recursion replies sent, 1 replies dropped, 0 states jostled out
>> 
> The above pattern repeats, with the last entry in the RDd list dropping
> off, and
> the `detached` and `waiting reply` counters going down by one each time.
> Once
> the list is emptied I get:
> 
>> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1
> recursion states (462 with reply, 1 detached), 1 waiting replies, 1444
> recursion replies sent, 1 replies dropped, 0 states jostled out
> [*snip* histogram]
> 0RDd mod0  b.root-servers.net. A IN
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1
> recursion states (462 with reply, 0 detached), 0 waiting replies, 1444
> recursion replies sent, 1 replies dropped, 0 states jostled out
> [*snip* histogram]
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .:
> soa probe serial is 2018051800
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth_zone
> unchanged, new lease, wait
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .
> timeout in 1800 seconds
> May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: close fd 15
> 
> From this point onward, queries are rejected. Half an hour later, the
> auth-zone
> lookup attempt repeats, and it all looks like above. 4 hours later, I get a
> bunch of lines like this:
> 
> May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
> start listening 15
> May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: Reading tcp
> query of length 17540
> May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
> stop listening 15
> May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point
> start listening 15
>> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: Reading tcp
> query of length 6523
> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: comm point
> stop listening 15
> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: xfr
> b.root-servers.net: last AXFR packet
> 
> Followed by a big list:
> 
> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
> at[0] . SOA
> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
> at[1] . RRSIG
>> May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr
> at[3028] camera. DS
> May 18 16:59:57 homebrew systemd-journald[199]: Suppressed 19535
> messages from unbound.service
> 
> Which looks like it got new records for the auth_zone? I'm not sure. It
> keeps
> looping the auth_zone lookups every half hour still.
> 
> I added some logging to mesh_make_new_space(), which showed that no
> jostle is
> attempted because mesh->jostle_first is a nullptr. I don't know if that's
> normal.
> 
> Since it seems to coincide with the auth-zone lookups, I will try
> disabling it
> to see if the issue resolves itself.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20180523/7d866831/attachment.bin>


More information about the Unbound-users mailing list