[CONTACT]

[ABOUT]

[POLICY]

rough demo of what it feels

Found at: gopher.blog.benjojo.co.uk:70/speedtest-congestion-feedback-loop

 Stressing the network when it's already down
 ===

A rough demo of what it feels like when your customers make more load as a result of you being overloaded

 A rough demo of what it feels like when your customers
make more load as a result of you being overloaded

Impact by Eric Wienke, edits by Ben Cartwright-Cox ;)/ CC BY-NC-SA

 Impact by Eric Wienke, edits by Ben Cartwright-Cox ;)/ CC
BY-NC-SA
 A few days ago shortly after 17:16 BST, a handful of
networks owned by Liberty Global (though the biggest impact was to
their UK network known as Virgin Media) started having issues
reaching the rest of the internet. What the exact official cause of
this is **currently only rumor**. However the outage had an
interesting pattern attached to it, in which their network would fail
almost every hour at 17 mins past (17:17 / 18:17 / 19:17 /
21:17 / 23:17 / 00:17). Producing this interesting graph of
systems going offline, generated from a system I use to monitor my
networks reachability to the rest of the internet:

Grafana graph showing dips in reachability

 Grafana graph showing dips in reachability
 Other providers have also posted their point of view of
the outages, but with bandwidth drops being shown instead:

Twitter screenshot: Yesterday doesn't look like it was much fun for network, operations and support folks at Virgin Broadband.

Twitter screenshot: Yesterday doesn't look like it was much fun for network, operations and support folks at Virgin Broadband.

 Twitter screenshot: Yesterday doesn't look like it was much
fun for network, operations and support folks at Virgin
Broadband.
 One interesting thing about the outages is that they all
started at the same time, and took a similar amount of time to
resolve.
 A possible cause might be a destructive crontab firing,
since the hourly crontab folder runs at exactly 17 mins past the
hour:
 ```
 [20:38:43] ben@metropolis:~$ grep hourly /etc/crontab
 17 *
/etc/cron.hourly
 ```
 I guess we will have to wait for a Reason For Outage
(RFO) report from them to know for sure.
 ---

Jump Networks

 Meanwhile during the outage, people in backchannels were
noticing that they were seeing traffic pickups to Virgin Media,
causing speculation that it was initially attack traffic driven.
However on deeper inspection this appeared to be their networks
Speedtest.net servers! These graphs generated from data given to me by
Jump Networks shows speedtest server traffic going to/from Virgin
Media with the following profile:

Jump net graph

 Jump net graph
 It would appear that every time that Virgin Media dropped
off, people en masse flocked to speedtest services to confirm
that their internet connection was having problems.

NetFlow

 If you go down to the NetFlow level, you can even see
very clearly when services were restored for customers:
 
alt="Netflow connection graph">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Time SpeedTest.net flows
2020-04-27T16:16:55 2
2020-04-27T16:16:56 5
2020-04-27T16:16:57 2
2020-04-27T16:16:58 1
2020-04-27T16:17:00 2
2020-04-27T16:17:01 0
2020-04-27T16:17:02 0
No traffic continues
2020-04-27T16:20:16 0
2020-04-27T16:20:17 47
2020-04-27T16:20:18 62
2020-04-27T16:20:19 46
2020-04-27T16:20:20 70
2020-04-27T16:20:21 137
2020-04-27T16:20:22 111
2020-04-27T16:20:23 99
2020-04-27T16:20:24 73
2020-04-27T16:20:25 79
2020-04-27T16:20:26 122
2020-04-27T16:20:27 78
2020-04-27T16:20:28 115
2020-04-27T16:20:29 82
2020-04-27T16:20:30 56
2020-04-27T16:20:31 122
 
 This kind of collective behaviour is fascinating to me, and
also presents an interesting customer driven positive feedback
loop for networks that might be having temporary congestion
problems, where people that are verifying that the network is
congested, are themselves adding more congestion to the network.
 Or put in the form of drawing:

The speed test feedback loop

 The speed test feedback loop
 This is not to say that this was the issue that caused
the Virgin Media outages, since these spikes started after
connectivity was restored, not during the outage window.

had reachability issues

Or when someone accidentally caused some devices out in the field to all call home at exactly the same time, overwhelming the local cellular network

 Some of this behaviour reminds me of that one time Apple's
captive.apple.com had reachability issues, causing cell networks almost
instantly to run out of capacity due to iPhones collectively
concluding that their local WiFi connections were broken and switching
to cellular data to mitigate imaginary connection problems. Or
when someone accidentally caused some devices out in the field
to all call home at exactly the same time, overwhelming the
local cellular network.
 Actions that may seem harmless as one, can quickly become
harmful if automated or done in synchronous by a large group of
people, and while it's easy to fix the automated ones, it's much
harder to fix people.
 ---
 I would like to thank James Rice from Jump Networks for
the data that backed this blog post.

rest of the blog

RSS Feed

twitter

 If this is your kind of stuff, you may find other bits
you like on the rest of the blog. If you want to stay up to
date with my ramblings or projects you can use my blog’s RSS
Feed or you can follow me on twitter.


AD: