style td table table pt pt
Found at: gopher.blog.benjojo.co.uk:70/speed-of-bgp-network-propagation
The speed of BGP network propagation
===
At the end of March 2019 I did a talk at the INEX's
(Ireland's biggest internet exchange point) Annual General Meeting. I
was supposed to record it, however in a brief panic due to
HDMI not working on my laptop I forgot to start a recording of
it. Since people found it interesting I figured I would turn
it into a blog post instead:
Intro side that is a rehash of a fast and furious poster
with JunOS placed on a car
I spent so much time on this opening side that I feel
the need to include it even if I'm not doing an introduction
this time.
a long time ago... in a job a year back
So a long time ago... in a job fa- well a year back. We
were dealing with the lovely routing design of anycast.
Anycast
For those who need a quick primer, it's a routing design
that allows you to do a more natural regional based load
balancing, allowing you to put server clusters in different regions
and serve traffic local to those regions without having to play
tricks with DNS:
Many nodes on the map
This works by having all participating nodes announce the
same IP prefixes globally, and with some careful routing tuning
(mainly careful selection of upstream transit/peering providers) you
can get good load balancing and latency results due to traffic
being served more local to the visitor's region.
Many nodes announcing the same IP prefix
Sadly, a lot of networks struggle to get consistent network
announcements to work right, often resulting in totally
backwards-from-logic routing:
Cat bowl based anycast routing
However even those networks that do get most regions down,
regions like Asia are much harder to get to route correctly,
partly due to local ISPs either dealing with overloaded links or
their links not always following logical geographic points.
Two cats eating from the same bown, when one cat really
should be eating from the other
The crux of the problem is ensuring your routing
announcements are consistent over all regions and over almost all major
interconnection ISPs (Tier 1's)
Routing providers changing, but the AS_PATH staying the same
length
In simple setups, this really just means you need to keep
your AS_PATH's as close to the same in all the regions and
carriers you want to have routing control over.
all countries announcing the same AS_PATH
This basically just means you should be attempting to use
the same providers and traffic engineering parameters in all
regions, AS_PATH prepending being one of the more basic ones.
one country annoncing a shorter path than the rest, then
bursting into flames
However, as systems get larger and more complex, eventually
a mistake is going to be made. In the case of the job, a
configuration misunderstanding during maintenance of a router caused it to
drop a traffic engineering prepend. This caused a huge traffic
shift globally towards this router, and almost instantly
overloading the site.
This was a regrettable incident, and it became clear that
while the traffic engineering prepend was useful in the past, at
that point in the network it was more of a liability than a
useful tool. So it was time to remove it.
But what if we were to make the same mistake again? This
time we are changing a lot more router configuration at once.
It's worth thinking about failure modes here. There are two ways
the change could fail:
one country announcing a longer AS_PATH than the rest,
causing traffic to move away
The first one being that the change applies to a large
percentage of the routers over the network, but some of them fail.
This would cause traffic to mostly shift away from those
locations and head to other nearby sites. As long as not too many
sites have this issue, this is the best way it can fail.
one country on fire due to announcing a too short of a
AS_PATH
The nastier way it can fail is that most routers *don't
end up being changed* but a small percentage of them do.
This will be a repeat of the first incident, however more
routers will be involved, and we would be dealing with a lot more
routers that would need a rescue configuration rollback or hotfix.
The story of them changing this in a sane way is not
mine to tell, however someone at the front of this change did
a talk at RIPE NCC's bi-annual event about how it was done:
[click here to go to that
talk](https://ripe77.ripe.net/archives/video/2222/
)
The good news is, the change went through fine, and no
router got left behind! A small amount of traffic churn happened
while routers globally had to update their routing tables and
inform the other internal routers they were connected to about
that.
During this time, it was observed that not all of the
providers accepted this change at the same time, some providers
seemed to reconverge almost instantly, but others were noticeably
slow.

This begs the question, how long does this sort of thing
generally supposed to take? And are some providers better than other,
who is the fastest? Who is the slowest?
However, for this we need to define what it means for a
route to be propagated. There are two valid ways (in my eyes)
this could be:
First Announcement wins diagram
The "First Announcement Wins" method is quite literally what
it says on the tin. When we see a BGP update message for
our prefix, that provider+location combo wins (or if they are
late, loses)
This could be slightly flawed, since some networks might
have hard to observe mechanisms for quickly sending routing
information inside their network, but those initial route updates
internally may not be sensible network paths.
First Stable Announcement wins diagram
In "First stable announcement wins" testing is done to
ensure that whatever route becomes "stable" (stops changing its
internal routing in the provider backbone) is declared the winner.
In my eyes this what most network engineers are looking
for, however it also has a large issue attached to it
Internals of networks are unknown and often magic
Figuring out what is a stable route is a non-trivial
amount of complexity, and no matter what way I do it, I think
it is not measurable to the 10's of milliseconds.
For this reason, the experiment we are doing is using
"First Announcement Wins."
The propagation race works like so:
slide containing the setup of the bgp race
The high precision timestamps are important here, and a
detail that actually ended up being slightly devastating for the
first few runs due to the inaccuracy of system clocks.
juniper supervisor card with the 10mhz and PPS ports
highlighted
You see, I now have a stronger respect (in that I now
actually believe they have worth) for the PPS and 10Mhz clock
inputs on a lot of high-end carrier routers. Since time syncing
is actually incredibly hard when you go above more than 2
systems. Locking all systems to a stable clock source is immensely
nice, and before you ask, NTP does not really get that close in
real life situations with a wide range of geographically
separated targets.
After a lot of time syncing and timestamp offset
correction, I ended up with a linear list of announcements by server
location (airport codes to signify where they are, since that's
generally what the networking industry seems to use)
a listing of all announcements in tsv format
For AS6453 (Tata communications) times looked decent to
start with. Giving that none of the BGP route update collector
nodes had Tata as a direct provider, this is basically racing
how fast Tata's peers can send routes around.
It's interesting that it seems to have a 500ms ish
minimum, but after that routes start to move around the globe very
fast, with the exception of EWR (New York area), Likely an
outlier.
id="t.0">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
class="c20">0
|
class="c2">Origin
|
|
|
rowspan="1"> |
class="c20">0.523
|
|
|
|
rowspan="1"> |
class="c20">0.586
|
class="c2">cdg
|
|
|
rowspan="1"> |
class="c20">0.593
|
|
|
|
rowspan="1"> |
class="c20">0.618
|
|
|
|
rowspan="1"> |
class="c20">0.725
|
|
|
|
rowspan="1"> |
class="c20">0.818
|
|
|
|
rowspan="1"> |
class="c20">0.831
|
|
|
|
rowspan="1"> |
class="c20">0.84
|
|
|
|
rowspan="1"> |
class="c20">0.845
|
|
|
|
rowspan="1"> |
class="c20">0.903
|
|
|
|
rowspan="1"> |
class="c20">0.926
|
class="c2">dfw
|
|
|
rowspan="1"> |
class="c20">0.994
|
class="c2">ord
|
|
|
rowspan="1"> |
class="c20">1.021
|
|
|
|
rowspan="1"> |
class="c20">1.066
|
|
|
|
rowspan="1"> |
class="c20">1.521
|
|
|
|
rowspan="1"> |
class="c20">1.712
|
class="c2">syd
|
|
|
rowspan="1"> |
class="c20">21.477
|
class="c2">dfw
|
|
|
rowspan="1"> |
class="c20">193.425
|
|
|
|
class="c9">
For AS174 (Cogent Communications) it seems that propagation
takes a little longer, due to policy on the upstream ISP used
for route collection, Cogent only was imported from other
carriers. So there is a similar effect with Tata here. However it
is odd that Toronto (YYZ) sees the route first after
announcement, Since the announcement is done in London (LHR).
This is likely the impact of a route reflector or
something inside the network
id="t.1">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c2">Origin
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
class="c2">syd
|
class="c2">174
|
|
rowspan="1"> |
|
class="c2">dfw
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
class="c2">cdg
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
|
class="c2">174
|
|
rowspan="1"> |
|
class="c2">ord
|
class="c2">174
|
|
rowspan="1"> |
|
class="c2">dfw
|
class="c2">174
|
|
class="c9">
For AS3257 (GTT) we are finally seeing some timing data
that is based on providers we are locally connected to. GTT
does seem to send things around the world reasonably fast, at a
shiny 1.9 seconds (apart from EWR, thus supporting that this is
more of a data point error rather than anything else)
id="t.2">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c0">Origin
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">dfw
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
class="c0">cdg
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
class="c0">syd
|
class="c0">3257
|
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3257
|
|
rowspan="1"> |
|
|
class="c0">3257
|
|
class="c9">
AS1299 (Telia) has a more logical timing, 0.6 seconds after
we announce in London it appears in Paris and Frankfurt
directly and it is fully propagated to all nodes less than 2
seconds after that. However other carriers beat telia to their own
route!, If you look at ORD (Chicago) and MIA (Miami) you can see
other carriers pick up the route from telia at another location,
and hand it over to our provider before 20 seconds later, it
arrives as a direct route.
id="t.3">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c0">Origin
|
class="c0">1299
|
|
rowspan="1"> |
|
class="c0">cdg
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
class="c0">dfw
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
|
rowspan="1"> |
|
class="c0">syd
|
class="c0">1299
|
|
rowspan="1"> |
|
|
class="c0">1299
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">1299
|
class="c0">Yes
|
class="c9">
Level 3 (AS3356) does by far the worst in this test,
taking 18 seconds from announcing the test prefix to it until it
appears anywhere on the internet, and it appears in SEA (Seattle)
of all places, and then from there other carriers pick up
that route and propagate it faster than Level 3. Some 30
seconds laster Level 3 has caught up and the route is now seen
in all places with Level 3 peering.
id="t.4">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c0">Origin
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
class="c0">cdg
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
class="c0">dfw
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
|
rowspan="1"> |
|
class="c0">syd
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">dfw
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">3356
|
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
|
class="c0">3356
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">cdg
|
class="c0">3356
|
class="c0">Yes
|
class="c9">
Last but not least is AS2914 (NTT Communications). Who
while is not the fastest at sending routes global, they did
appear to be the most smooth and consistent.
id="t.5">
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c2">Origin
|
|
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">cdg
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
class="c0">dfw
|
|
|
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
|
|
rowspan="1"> |
|
|
|
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">syd
|
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">dfw
|
|
class="c0">Yes
|
rowspan="1"> |
|
class="c0">ord
|
|
class="c0">Yes
|
rowspan="1"> |
|
|
|
class="c0">Yes
|
class="c9">
Now that we have done all the carriers you may think that
is it, however there is a different kind of propagation we
can observe:
the dying breaths of a route
As we can race the network in sending out bgp routes, we
can also race them withdrawing them!
This is a test that is harder to see on the routing
table itself, so it's easier (and much more fun) to observe it
by simply doing a traceroute to a prefix and then withdrawing
it from all providers:
a gif showing a route slowly draining out of the routing
table of many carriers
Here you can see the route slowly being released out of
all of the carriers, and then the carrier backbones, and then
the carrier inter peering relationships. It also exposes some
interestingly strange routing too as options to route a prefix begin to
run out!
the ending/questions slide
Anyway, as I said to the audience, We have had the fast
bit, now we can have the furious part! If you generally like
this kind of post, I aim to post once a month on various
(mostly networking related) matters. If you want to stay up to
date with that, you can either use my blog's rss or you can
follow me on twitter for updates when the next post happens.
I would like to thank AS57782 / Cynthia Revstrom for
lending some IPv4 space for this post, and helping out on the
traceroute demo you see above.
If you do have any questions about this talk, please feel
free to reach out on the email that is on the slide above!
Until next time!