[Unbound-users] Unbound 1.4.1 cache problems

Sat Jan 9 09:24:20 UTC 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jakub,

I cannot know for sure since the log seems to be shortened by syslog,
but I think you may have the same problem as Haw Loeung reported.  I
have since corresponded with him and I think have resolved the issue.  A
patch is in svn for unbound.

The issue seems to be caused by the parent-child disagreement fix in
1.4.1, worsened by no IPv6 available (either not in the kernel or
do-ip6:no statement) - just turning ipv6 back on (also if you have no
ipv6 connectivity) may stop this bug.

Sorry for the inconvenience,
   Wouter

On 01/08/2010 10:36 PM, Jakub Heichman wrote:
> Hello there,
> 
> I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
> servers. The servers are doing about 3-7k requests per second.
> I have noticed that after some time of operation, servers
> intermittently stop responding to queries to certain domains from
> cache (return SERVFAIL). It particularly doesn't like webex.com,
> sometimes vmware.com, dell.com.
> 
> # unbound-host webex.com
> webex.com has address 64.68.120.99
> webex.com mail is handled by 50 mx3.webex.com.
> webex.com mail is handled by 10 mx1.webex2.iphmx.com.
> webex.com mail is handled by 10 mx2.webex2.iphmx.com.
> webex.com mail is handled by 50 mx2.webex.com.
> 
> But the A record doesnt seem to get cached:
> 
> # dig webex.com @localhost
> 
> ; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
> ; (1 server found)
> ;; global options:  printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
> 
> ;; QUESTION SECTION:
> ;webex.com.			IN	A
> 
> ;; Query time: 0 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Fri Jan  8 19:16:01 2010
> ;; MSG SIZE  rcvd: 27
> 
> # unbound-control dump_cache | grep ^webex.com
> webex.com.	4589	IN	NS	ns01.webex.com.
> webex.com.	4589	IN	NS	ns05.webex.com.
> webex.com.	4589	IN	NS	ns07.webex.com.
> webex.com.	4589	IN	NS	ns03.webex.com.
> webex.com.	4589	IN	NS	ns08.webex.com.
> webex.com.	4589	IN	NS	ns06.webex.com.
> 
> It doesnt respond to NS queries either (No A records for
> nsxx.webex.com in cache)
> 
> # dig ns webex.com @localhost
> 
> ; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
> ; (1 server found)
> ;; global options:  printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
> 
> ;; QUESTION SECTION:
> ;webex.com.			IN	NS
> 
> ;; Query time: 0 msec
> ;; SERVER: 127.0.0.1#53(127.0.0.1)
> ;; WHEN: Fri Jan  8 20:53:18 2010
> ;; MSG SIZE  rcvd: 27
> 
> 
> 
> After flushing webex.com I can query the cache again:
> 
> # unbound-control flush webex.com
> ok
> # dig +short webex.com @localhost
> 64.68.120.99
> 
> Log with increased verbosity:
> 
> an  8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
> query <webex.com. A IN>
> Jan  8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
> Jan  8 21:05:32 cns01j unbound: [25052:3] info:
> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
> avail)
> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns08.webex.com.  A
> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns05.webex.com.  A
> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns01.webex.com.  A
> Jan  8 21:05:32 cns01j unbound: [25052:3] info:   ns06.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
> <webex.com. A IN>
> Jan  8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
> <webex.com. A IN>
> Jan  8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
> <webex.com. A IN>
> Jan  8 21:05:32 cns unbound: [25052:3] info:
> DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
> avail)
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns08.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns05.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns01.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns06.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns07.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info:   ns03.webex.com.  A
> Jan  8 21:05:32 cns unbound: [25052:3] info: validator operate: query
> <webex.com. A IN>
> Jan  8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
> states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
> replies sent, 0 replies dropped, 0 states jostled out
> Jan  8 21:05:32 cns unbound: [25052:3] info: average recursion
> processing time 4.172319 sec
> Jan  8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
> processing times
> Jan  8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
> median[50%]=0.00744727 [75%]=0.0524288
> Jan  8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000000    0.000001 9
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.000256    0.000512 1
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.001024    0.002048 1
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.002048    0.004096 19
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.004096    0.008192 2
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.008192    0.016384 11
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.016384    0.032768 3
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.032768    0.065536 5
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.065536    0.131072 15
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.131072    0.262144 10
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.262144    0.524288 5
> Jan  8 21:05:32 cns unbound: [25052:3] info:    0.524288    1.000000 2
> Jan  8 21:05:32 cns unbound: [25052:3] info:  256.000000  512.000000 1
> 
> 
> 
> I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
> I don't see this problem.
> 
> Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running
> 
> num-threads: 4
> outgoing-range: 10000
> msg-cache-size: 500m
> msg-cache-slabs: 1
> num-queries-per-thread: 10000
> jostle-timeout: 4000
> so-rcvbuf: 8m
> rrset-cache-size: 1024m
> rrset-cache-slabs: 1
> cache-max-ttl: 86400
> 
> Any help would be greatly appreciated :-)
> 
> Thanks,
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAktIS0QACgkQkDLqNwOhpPjQ2ACfeCESH6t5ezR94AnasM6DP16f
EsIAoJbV90YIEyovcdz2VWTRmEk4/tMY
=Zfvf
-----END PGP SIGNATURE-----