Dealing with our first major outage

As a hosting company, high availability is crucial. Bytemark now has a 99% uptime guarantee and a 100% core network guarantee. But back in 2009, we experienced a major networking issue. Which, although quickly resolved, has had a lasting impact on the company in terms of what we learnt from the situation.

What was the issue?

The technical issue, as far as we have diagnosed it, was that our core switching infrastructure in Manchester was spontaneously swamped with traffic. While we were trying to find its source, one of our core routers crashed and reloaded itself. When the router was operational again, the obstructive traffic disappeared, which brought the network back online.

If you’re interested, I explained in more detail on our forum: power outage and network outage.

In the end, a broken supervisor was located and replaced. Since then, we have built our own data centre rather than renting space. This gave us greater control over our infrastructure, hardware and monitoring. Meaning another outage like this one is highly unlikely to happen again.

But that wasn’t the biggest problem

We knew that in the future, just resolving technical issues wasn’t good enough. There was a bigger problem, our support systems went dark during the outage.

When the servers went down, our phone line was also unreachable. We’d not planned for a core network outage of longer than a few minutes, so I resorted to  Twitter to post updates.  But, Twitter-using customers were the minority back then. Links were shared around the internet, but it took Google a couple of hours into the outage to pick up on other discussion threads and index them. So whilst it was a solution, it certainly wasn’t ideal.

Making sure it didn’t happen again

Since this outage, we’ve made many changes to make sure our customers can always stay informed about the network.

1) Sorry Servers

To help our customers, and your own clients, in the event of outages, we introduced what I call a “sorry server”.

At the time, packets could come into our network for routing at one of four core routers – two in London, and two in Manchester.  They were the front doors to our network, and had an enormous routing capacity, planned to take any kind of abusive traffic from the outside.  We can manage if only one of them is working in each site.

When something went wrong for a website owner – whether it’s their server being rebooted for a few minutes, or a major external event, that incoming traffic won’t make it to their server.  The visitor just saw a spinning hourglass and a “connecting to host…” message that eventually turned into an inexplicable “timeout error”.  That wasn’t good enough.

I placed “sorry servers” on the edge of the network, plugged directly into all four core routers.  These servers had no other function than to be ready to say “sorry” if something went wrong. They provided a simple web page giving visitors more information about the problem. Emails could be delayed or turned away while the “sorry” server is answering, allowing our team to focus on fixing any problems.

We extended the service to all customers, so you can upload and selectively switch on your own “sorry page” when you’re performing maintenance, moving between hosts, or for any other reason.

2) Service Status Report

We created status.bytemark.org – an off-site status page for our own support services. All we need for this to function is one working core router out of four, one transit connection in, and the decision to partially shut down the network to turn on this facility if we have to.  This provides a source of information about issues past and present, giving greater transparency to our users. Find out more about how and why we built this.

Why not DNS updates?

A couple of customers asked us why they couldn’t update their Bytemark-hosted DNS quickly in the event of an outage.  Answer: because it won’t work fast enough – DNS changes really need days to percolate through the internet’s various DNS caches.  People will see your old IP even after you’ve made a change, but worse, when you want to switch it back again, you might be stuck pointing at your backup hosting for much longer than you intended.

The sorry server was a smarter way of doing this and allowed you to send instant HTTP redirects and turn SMTP responses off/on instantly.

The long-term plan

The sorry server was only a starting point, but a flexible one.  I still duplicated our forums, main web site and phone answering service onto an off-site server, allowing us to flick a switch to take those services off our network if necessary.  So Bytemark will always have the bones of our support operation present.

Our New Phone System

In 2004, I downloaded asterisk. This program is very capable, but the design leaves more to be desired. It features about four different scripting languages which all boil down to something that looks like 20-year old BASIC.  U-turns abound in its documentation, even the simple task of setting a variable says “Version differences: This command is not available in Asterisk 1.0.9. Use SetVar instead. As of v1.2 SetVar is deprecated and we are back to Set.”

So, despite these criticisms, asterisk has been running our company phones – including home workers, voicemail, smart caller ID, smooth redirections to our Manchester call centre, for the last five years, with barely any maintenance. But I’d never thought very hard about how to make it robust.

Planning for the lights to go out

In response to the outage mentioned earlier in this article, my priority was making sure we couldn’t lose touch with customers again. So I needed to rearrange our phone system to survive in the face of network trouble. Currently, every phone connects to the office phone server, and if that’s down, the line is dead.  If the office ADSL is down, the line is dead, even the home workers can’t talk to our customers. If the London network stops working, the line is dead.

Increasing Redundancy

To take these points of failure out, I can take advantage of our network: we have two pretty separate parts to our core. Our London racks are rich with connections but very expensive, so we don’t host much there.   Our newer Manchester space is physically larger, but without the same richness in terms of criss-crossing minor connections.  The failures that we’ve seen have only ever affected one or the other, so I’ve put one new phone server in each location.

How do these servers function as one unit?

We only have one advertised phone number, and one supplier for this phone number, Magrathea.  Unfortunately, they will only try to route to one of my servers at a time, so I have to pick one to receive our incoming calls – their racks are in London, so I’ll use the London server unless anything goes wrong.  If it does, I can update Magrathea and ask them to send the calls elsewhere.  But they can all send their outgoing calls through Magrathea at once if necessary.

At the other end, on our desks, each phone has a very handy function allowing them to connect to 4 SIP servers at once, each with its own button. So, I can tell every one of our handsets to connect to every one of our four servers, and all the individual servers to connect to each other in a big mesh.  It’s all over IP, so it’s free!

How will this work for handling calls?

Instead of having to worry about which server is currently “live”, I’ve told every server to try to dial out four times simultaneously.  The first attempt is to a “local” SIP connection, our desk phones connecting directly to that server.  The other three connections go to the other three peers and try to connect to the same phone via that peer.  So the command looks like this:

Dial(SIP/mbloch-desk&IAX2/manchester/301&IAX2/office/301&IAX2/offsite/301)

When any one of those connections picks up, Asterisk cancels the other dialling attempts and the call proceeds.  If any of the connections are down, whether to another server or the phone itself, Asterisk quietly gives up and carries on ringing the other connections.  All the while, the caller only hears a single ringtone.

So the upshot is that if one server fails, outbound calls still work by the user hitting “line 2” or “line 3”, and the only thing I have to do is signal Magrathea to ask them to send incoming calls through another server.  The meshing means customers never hear a deadline again, hooray!

Testing and Deployment

In order to make the setup as robust and easy to test as possible, I’ve used the excellent Capistrano automation tool to package up the whole configuration and startup routines into one place.  So when I make a change I can just type “cap deploy” and all my changes go out to the four servers, and everything.  It’s very well suited to this, I hardly had to make any changes to the way it works.

But I’ve not implemented any automatic testing yet because I’ve not got my head around a couple of relevant tools, SIPp appears to be the only one that I could find, and that doesn’t help me test the meshed IAX connections.

Because these are all new servers, I’ve just made most of it live and placed heaps of calls.  I think that’s what real phone engineers do, at least in part.   The worst that has happened is a cascade between servers (i.e. one inbound call caused four more internal calls, which caused four more internal calls, which caused etc.).  I thought it was very funny at the time because it made my phone look like a Christmas tree. But I’d forgotten to take out our call centre’s number as a fallback, so I’d accidentally placed 50 simultaneous silent calls to them.  More than once I think.  They told me so.  Sorry!

Then, going forward, the same strategically-placed servers will also run self-contained copies of our main website, forum and support email facilities, and I’ll be testing the failover for those.