So Pete and I have resolved the situation from last Sunday and I hope explained it in detail. It’s a shame that we will start to treat our "carrier grade" routers like crabby Windows 98 PCs, and reboot them at the first sign of trouble. But I challenge any other network engineer to say they could come up with a design that would have withstood this fault, or a sensible method of diagnosing and fixing it.
The broken supervisor is now replaced, and we’re moving on to the larger problem – which was that our support systems went dark during the outage. We’d not planned for a core network outage of longer than a few minutes, so I resorted to Twitter to post updates. Embarrassing, as Twitter itself presented its famous fail whale for a few minutes around 7pm. But definitely preferable to sitting on my thumbs while Pete was working on the fault itself – I went from 4 to 250 "followers" in a couple of hours, with links quickly making their way around the internet on other forums. Tim Anderson wrote a thoughtful piece about how useful the mechanism was, but Twitter-using customers are the minory – it took Google a couple of hours into the outage to pick up on other discussion threads and index them.
I’m not trying to gloss – it was a terrible mechanism, if only because I had to describe the problem in haiku, and most of our customers didn’t know where to look.
Expecting the unexpected
So what are we doing to fix this? The starting point for my disaster planning is that our core network never goes down. That’s the design. It’s intended to be incredibly unlikely, and we go to great expense to ensure that it won’t. While our entire network wasn’t down on Sunday (London was still alive), the lesson to learn is that in the face of treacherous network equipment, all bets are off and this would have taken down anyone else’s core network. If this is the kind of fault we might expect, we need to work around it.
To be clear, Bytemark has one core network, with one business goal, reliable hosting for a group of customers with similar needs. If that core breaks, all possible resources are diverted to fixing it. When that happens, we’re not in a position to answer emails, phone calls or do anything other than fix the fault – but – we will frequently update customers on our progress and an ETA. As soon as our core network is functioning again, support emails can queue up, operators can take messages and so on. But I am planning a "red alert" network state where communication can only be one way – there is only one question customers will want to ask in these situations, and only one answer we will be able to give. Everything else should wait.
At least say you’re sorry – no more timeouts
I’m still planning the details, but a common component is what I’m calling our "sorry server" to help both us, our customers, and our customers’ customers, in the event of small or large outages.
At the moment, packets can come into our network for routing at one of four core routers – two in London, and two in Manchester. They are the front doors to our network, and have an enormous routing capacity, planned to take any kind of abusive traffic from the outside. We can manage if only one of them is working in each site.
When something is going wrong for a web site owner – whether it’s their server being rebooted for a few minutes, or a major external event, that incoming traffic won’t make it to their server. The visitor just sees a spinning hourglass and a "connecting to host…" message that eventually turns into an inexplicable "timeout error". That’s not what we want.
I’m intending to place 2 or 3 "sorry servers" on the edge of the network, plugged directly in to all four core routers. These servers have no other function than to be ready to say "sorry" if something goes wrong. Just a simple web page, a URL to point visitors at for more information. Email can get delayed or turned away while the "sorry" server is answering.
For our own support services, we can make sure that these points at an off-site status page. All we need for this to function is one working core router out of four, one transit connection in, and the decision to artially shut down the network to turn on this facility if we have to. In the past six years we’ve never have been without this option, so I’d be confident that it can work.
I’d like to also open the service to all customers, so you can upload and selectively switch on your own "sorry page" when you’re performing maintenance, moving between hosts, or for any other reason. I’ll document this service on our main web site when it becomes available.
Why not DNS updates?
A couple of customers asked us why they couldn’t update their Bytemark-hosted DNS quickly in the event of an outage. Answer: because it won’t work fast enough – DNS changes really need days to percolate through the internet’s various DNS caches. People will see your old IP even after you’ve made a change, but worse, when you want to switch it back again, you might be stuck pointing at your backup hosting for much longer than you intended.
The sorry server will be a smarter way of doing this, and allow you to send instant HTTP redirects or SMTP responses should be turn off-and-onnable instantly.
Getting our house in order
The sorry server is only a starting point, but a flexible one. I will still duplicate our forums, main web site and phone answering service onto an off-site server, and be able to flick a switch to take those services off our network if necessary. So Bytemark will always have the bones of our support operation present, even if a similar situation cropped up again. I’ll let you know how we’re doing on this.
From this and other recent hosting war stories and scares, I’ve been intending to complete our failsafe systems in creative ways, and fix niggling bugs, before striding ahead with new developments. So now is an excellent time for any customers to email me with your least favourite Bytemark bug, and I’ll let you know what I’m doing about them.