A study of our DNS full-resolvers

a study of
our DNS full-resolvers
Matsuzaki ‘maz’ Yoshinobu
<maz@iij.ad.jp>

Topic for today
• Lesson learned from an outage of our full-resolvers
• An interesting behavior of clients

Users and DNS full-resolver
full-resolver
(cache nameserver)
Most users are using ISP’s full-
resolvers as those information
are provided automatically
maz@iij.ad.jp 3

DNS cache nameserver
• Usually ISPs provide 2 nameservers for customers
• Just in case
• Our assumptions here:
• Even single server was failed, another server can handle
DNS queries
• Users somehow automatically pick an usable one up for
their use
maz@iij.ad.jp 4

In 2009, we had a trouble
• Trouble on cache nameservers for consumers
• Apr, 2009
• On two (all) nameservers
• 1st failure happened on
a server (ns01)
• then 2nd failure happened
on another server (ns11)
• About 12min blackout
maz@iij.ad.jp 5

Failures on our cache nameservers
• ns01: 17:14:26 - 17:48:07 (33min14sec)
• ns11: 17:35:51 - 17:48:52 (13min01sec)
• During the both servers were in trouble
(12min16sec), our users couldn’t resolve
hostnames
• During this trouble, the servers couldn’t answer
14,005,644 DNS queries
maz@iij.ad.jp 6

The query graph
maz@iij.ad.jp 7

Before failure
• Clients prefer to use ns01
• Order of configuration?
• Clients sent DNS queries to
another server as well
• measuring delays?
• just in case?
maz@iij.ad.jp 8

During single failure
• DNS queries to ns01 were
discarded during this period
• It seems users could still resolve
hostnames as the ns11 was alive
• No strange traffic pattern here
• Users might feel some delays
• A bit higher rate of queries in the
first 3min, and then ‘stable’ state
maz@iij.ad.jp 9

During single failure (cont.
• Query rate(ns01+ns11) looks
almost the same as before
• Even though ns01 was discarding
queries during this period
• Probably most clients usually send
DNS queries to both nameservers
maz@iij.ad.jp 10

During double failure (outage)
• Users couldn’t resolve hostnames
at all during this period
• Query rate suddenly increased on
both nameservers
• Those are all discarded though
• Mostly because of ‘retries’
• We observed multiple queries that
has the same QNAME
maz@iij.ad.jp 11

Restoration
• Once ns01 was restored, it got
about 7times more queries than
usual for several seconds
• Web pages are composed by many
“modules”
• Single web page makes several and
more DNS queries sometimes
• Browsers’ prefetch function
• 12min was enough to flush
clients’ side DNS cache
maz@iij.ad.jp 12

Restora(on (cont.
• Then ns11 was also restored
• It also got higher rate of queries for
several seconds
• Gradually the query rate were
getting ‘normal’ state as same as
the before
maz@iij.ad.jp 13

Lesson learned
• Single server failure will not cause a disaster, when
users configure multiple DNS cache servers on their
device
• Probably the impact could be negligible
• During double failure (full-outage), nameservers
got more queries
• Once a server is restored from full-outage, it gets
higher query rate for a while
• In our case, 7 times more than usual for several seconds
maz@iij.ad.jp 14

Redundancy is important
• DNS resolving works somehow as long as one of servers
is functional on each part of DNS
• Full-resolvers (caching nameservers)
• Authoritative nameservers
• Do have redundancy, avoid outage
• A multiple server deployment works well
• IP anycast would be also useful
• my bdnog7 talk - https://www.slideshare.net/bdnog/ip-anycasting
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning for those who has a security device in front of
nameservers

Users and DNS full-resolver
full-resolver
(cache nameserver)
Many devices
on a home network
maz@iij.ad.jp 16

Usual graph - 5min average
measurement date: 2018/05/16

A bit different view - 1sec average

Peaks on the hour
• Minor peaks on the hour and half (like 07:30)
• “Alarm clock” wakes up the phone itself
• Some applications are also initiated by the wakeup
• I guess those are mostly coming from smartphones
• QNAMEs also hint

A spike at 15sec before the hour

we’ve something different now

Summary
• It’s reasonable for us to provide 2 full-resolvers
(caching nameservers) to customers
• Clients seem to have the ability to use a functional one
• Once outage, we should expect a large amount of
queries during and just after the outage
• A warning to those who has a security device in front of
nameservers
• Clients are synced up unintentionally
• ‘alarm clock’ or scheduled tasks
• This particular case is not an issue at this moment, but
it’s worth to pay attention to those behaviors
maz@iij.ad.jp 22

A study of our DNS full-resolvers

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a A study of our DNS full-resolvers

Similar a A study of our DNS full-resolvers (20)

Más de Bangladesh Network Operators Group

Más de Bangladesh Network Operators Group (20)

Último

Último (20)

A study of our DNS full-resolvers