Jostle logic seems to randomly stop working

Dmitri Kourennyi dkour at mykolab.com
Sat May 19 01:59:34 UTC 2018


More investigation results:

I think the issue appears when unbound is trying to probe the master 
servers for
the auth_zone section. The logs show unbound doing lookups on all the 
auth_zone
domain names in my config, and I think unbound is answering them from 
its own
cache. After the lookups, I get the following in the logs:

May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 
13 recursion states (462 with reply, 13 detached), 13 waiting replies, 
1444 recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 0RDd mod0  
b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 1RDd mod0  
c.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 2RDd mod0  
f.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 3RDd mod0  
g.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 4RDd mod0  
k.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 5RDd mod0  
xfr.cjr.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 6RDd mod0  
xfr.lax.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 7RDd mod0  
b.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 8RDd mod0  
c.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 9RDd mod0  
g.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 10RDd mod0  
k.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 11RDd mod0  
xfr.cjr.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 12RDd mod0  
xfr.lax.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 
12 recursion states (462 with reply, 12 detached), 12 waiting replies, 
1444 recursion replies sent, 1 replies dropped, 0 states jostled out
⋮

The above pattern repeats, with the last entry in the RDd list dropping 
off, and
the `detached` and `waiting reply` counters going down by one each time. 
Once
the list is emptied I get:

⋮
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1 
recursion states (462 with reply, 1 detached), 1 waiting replies, 1444 
recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
0RDd mod0  b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1 
recursion states (462 with reply, 0 detached), 0 waiting replies, 1444 
recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .: 
soa probe serial is 2018051800
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth_zone 
unchanged, new lease, wait
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone . 
timeout in 1800 seconds
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: close fd 15

 From this point onward, queries are rejected. Half an hour later, the 
auth-zone
lookup attempt repeats, and it all looks like above. 4 hours later, I 
get a
bunch of lines like this:

May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point 
start listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: Reading tcp 
query of length 17540
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point 
stop listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point 
start listening 15
⋮
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: Reading tcp 
query of length 6523
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: comm point 
stop listening 15
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: xfr 
b.root-servers.net: last AXFR packet

Followed by a big list:

May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr 
at[0] . SOA
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr 
at[1] . RRSIG
⋮
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr 
at[3028] camera. DS
May 18 16:59:57 homebrew systemd-journald[199]: Suppressed 19535 
messages from unbound.service

Which looks like it got new records for the auth_zone? I'm not sure. It 
keeps
looping the auth_zone lookups every half hour still.

I added some logging to mesh_make_new_space(), which showed that no 
jostle is
attempted because mesh->jostle_first is a nullptr. I don't know if 
that's
normal.

Since it seems to coincide with the auth-zone lookups, I will try 
disabling it
to see if the issue resolves itself.



More information about the Unbound-users mailing list