rough demo of what it feels
Found at: gopher.blog.benjojo.co.uk:70/speedtest-congestion-feedback-loop
Stressing the network when it's already down
===
A rough demo of what it feels like when your customers
make more load as a result of you being overloaded
Impact by Eric Wienke, edits by Ben Cartwright-Cox ;)/ CC BY-NC-SA
Impact by Eric Wienke, edits by Ben Cartwright-Cox ;)/ CC
BY-NC-SA
A few days ago shortly after 17:16 BST, a handful of
networks owned by Liberty Global (though the biggest impact was to
their UK network known as Virgin Media) started having issues
reaching the rest of the internet. What the exact official cause of
this is **currently only rumor**. However the outage had an
interesting pattern attached to it, in which their network would fail
almost every hour at 17 mins past (17:17 / 18:17 / 19:17 /
21:17 / 23:17 / 00:17). Producing this interesting graph of
systems going offline, generated from a system I use to monitor my
networks reachability to the rest of the internet:
Grafana graph showing dips in reachability
Other providers have also posted their point of view of
the outages, but with bandwidth drops being shown instead:
Twitter screenshot: Yesterday doesn't look like it was much
fun for network, operations and support folks at Virgin
Broadband.
One interesting thing about the outages is that they all
started at the same time, and took a similar amount of time to
resolve.
A possible cause might be a destructive crontab firing,
since the hourly crontab folder runs at exactly 17 mins past the
hour:
```
[20:38:43] ben@metropolis:~$ grep hourly /etc/crontab
17 *
/etc/cron.hourly
```
I guess we will have to wait for a Reason For Outage
(RFO) report from them to know for sure.
---
Meanwhile during the outage, people in backchannels were
noticing that they were seeing traffic pickups to Virgin Media,
causing speculation that it was initially attack traffic driven.
However on deeper inspection this appeared to be their networks
Speedtest.net servers! These graphs generated from data given to me by
Jump Networks shows speedtest server traffic going to/from Virgin
Media with the following profile:
Jump net graph
It would appear that every time that Virgin Media dropped
off, people en masse flocked to speedtest services to confirm
that their internet connection was having problems.
If you go down to the NetFlow level, you can even see
very clearly when services were restored for customers:
alt="Netflow connection graph">
table {
border-collapse: collapse;
}
table, td, th {
border: 1px solid black;
}
Time | SpeedTest.net flows |
---|---|
2020-04-27T16:16:55 | 2 |
2020-04-27T16:16:56 | 5 |
2020-04-27T16:16:57 | 2 |
2020-04-27T16:16:58 | 1 |
2020-04-27T16:17:00 | 2 |
2020-04-27T16:17:01 | 0 |
2020-04-27T16:17:02 | 0 |
… | No traffic continues |
2020-04-27T16:20:16 | 0 |
2020-04-27T16:20:17 | 47 |
2020-04-27T16:20:18 | 62 |
2020-04-27T16:20:19 | 46 |
2020-04-27T16:20:20 | 70 |
2020-04-27T16:20:21 | 137 |
2020-04-27T16:20:22 | 111 |
2020-04-27T16:20:23 | 99 |
2020-04-27T16:20:24 | 73 |
2020-04-27T16:20:25 | 79 |
2020-04-27T16:20:26 | 122 |
2020-04-27T16:20:27 | 78 |
2020-04-27T16:20:28 | 115 |
2020-04-27T16:20:29 | 82 |
2020-04-27T16:20:30 | 56 |
2020-04-27T16:20:31 | 122 |
This kind of collective behaviour is fascinating to me, and
also presents an interesting customer driven positive feedback
loop for networks that might be having temporary congestion
problems, where people that are verifying that the network is
congested, are themselves adding more congestion to the network.
Or put in the form of drawing:
The speed test feedback loop
This is not to say that this was the issue that caused
the Virgin Media outages, since these spikes started after
connectivity was restored, not during the outage window.
Some of this behaviour reminds me of that one time Apple's
captive.apple.com had reachability issues, causing cell networks almost
instantly to run out of capacity due to iPhones collectively
concluding that their local WiFi connections were broken and switching
to cellular data to mitigate imaginary connection problems. Or
when someone accidentally caused some devices out in the field
to all call home at exactly the same time, overwhelming the
local cellular network.
Actions that may seem harmless as one, can quickly become
harmful if automated or done in synchronous by a large group of
people, and while it's easy to fix the automated ones, it's much
harder to fix people.
---
I would like to thank James Rice from Jump Networks for
the data that backed this blog post.
If this is your kind of stuff, you may find other bits
you like on the rest of the blog. If you want to stay up to
date with my ramblings or projects you can use my blog’s RSS
Feed or you can follow me on twitter.