Notes on a robust business phone network

I’m close to flicking the switch on our new phone system – is anyone was interested in the design?   I’m not certain myself, which is to say I’m not a phone engineer of any kind.  Last I looked, phone engineers lived in a topsy-turvy parallel universe with their own private members clubs, secret handshakes from government and cultish ideas about circuit switching.  Me, I just downloaded asterisk five years ago and found a rebel band of software engineers trying to crowbar their way in.

And there is something of the crowbar about asterisk – a box of bits featuring about four different scripting languages which all boil down to something that looks like 20-year old BASIC.  U-turns abound in its documentation, even the simple task of setting a variable says "Version differences: This command is not available in Asterisk 1.0.9. Use SetVar instead. As of v1.2 SetVar is deprecated and we are back to Set."  Somehow this shed of a program has been running our company phones – including home workers, voicemail, smart caller ID, smooth redirections to our Manchester call centre, for the last five years, with barely any maintenance.  Whatever crimes of design it might have committed, asterisk is very capable, but I’d never thought very hard about how to make it robust.

Planning for the lights to go out

My first priority after last month was to make sure that we couldn’t lose touch with our customers again, and because the current asterisk system is a bit of a toy (albeit a long-lived and very reliable one) I needed to rearrange it to survive in the face of network trouble.  And the design work will be very similar for our key services – our email support, web site, forums and so on.

So this is what our network looks like for disaster recovery purposes:

Right now, we’ve got our awesome Fisher-Price FBI phones, GXP2000s in the office, one per desk, and a couple of home workers.  Every phone to connect ot the ofifce phone server, and if that’s down, the line is dead.  If the office ADSL is down, the line is dead, even the home workers can’t talk to our customers.  If the London network stops working, the line is dead.

To take these points of failure out, I can take advantage of our network: we have two pretty separate parts to our core. Our London racks are rich with connections but very expensive, so we don’t host much there.   Our newer Manchester space is physically larger, but without the same richness in terms of criss-crossing minor connections.  The failures that we’ve seen have only ever affected one or the other, so I’ve put one new phone server in each location.

I’m also commissioning another offsite in the Netherlands.  This could prove to be a rotten idea because of the latency, but we’ll have to see – it’s intended as an absolute last resort in the unprecendented event of both sides shutting down.

Who’s talking to whom

I can install all these servers, but how do they function as one unit? 

We only have one advertised phone number, and one supplier for this phone number, the shadowy folk at Magrathea (whose services are the lynchpin to most of the UK VoIP industry, as far as I can tell).  Unfortunately they will only try to route to one of my servers at a time, so I have to pick one to receive our incoming calls – their racks are in London, so I’ll use the London server unless anything goes wrong.  When it does, I can update Magrathea and ask them to send the calls elsewhere.  But they can all send their outgoing calls through Magrathea at once if necessary.

At the other end, on our desks, the GXP2000s have a very handy function allowing them to connect to 4 SIP servers at once, each with its own button – aha:

So what I’ve been able to do is to tell every one of our handsets to connect to every one of our four servers, and all the individual servers to connect to each other in a big mesh.  It’s all over IP, so it’s free!  The wonder of VoIP.  This is what it looks like:

Asterisk mesh

 At our desks we can all feel like Wall Street big shots- "yah yah, I’ll just have to try routing that one via London, excuse me one second (beep) hang on let me  send that one through Amsterdam (beep)", pressing buttons to send our outgoing calls through each different server server.  No, not really, only a fool would do that.

But it does mean that we can receive incoming calls from any server.  Instead of having to worry about which server is currently "live", I’ve told every server to try to dial out four times simultaneously.  The first attempt is to a "local" SIP connection, our desk phones connecting directly to that server.  The other three connections go to the other three peers, and try to connect to the same phone via that peer.  So the command looks like this:

Dial(SIP/mbloch-desk&IAX2/manchester/301&IAX2/office/301&IAX2/offsite/301)

When any one of those connections picks up, Asterisk cancels the other dialling attempts, and the call proceeds.  If any of the connections are down, whether to another server or the phone itself, Asterisk quietly gives up and carries on ringing the other connections.  All the while, the caller only hears a single ring tone.

So the upshot is that if one server conks out, outbound calls still work by the user hitting "line 2" or "line 3", and the only thing I have to do is signal Magrathea to ask them to send incoming calls through another server.  The meshing keeps any dirty secrets about our network status hidden from callers, and customers never hear a dead line again, hooray!

Testing and deployment

In order to make the setup as robust and easy to test as possible, I’ve used the excellent Capistrano automation tool to package up the whole configuration and startup routines into one place.  So when I make a change I can just type "cap deploy" and all my changes go out to the four servers, and everything.  It’s very well suited to this, I hardly had to make any changes to the way it works.

But I’ve not implemented any automatic testing yet because I’ve not got my head around a couple of relevant tools, SIPp appears to be the only one that I could find, and that doesn’t help me test the meshed IAX connections.

Because these are all new servers, I’ve just made most of it live, and placed heaps of calls.  I think that’s what real phone engineers do, at least in part.   The worst that has happened is a cascade between servers (i.e. one inbound call caused four more internal calls, which caused four more internal calls, which caused etc.).  I thought it was very funny at the time because it made my phone look like a Christmas tree.  Ho ho ho.  But I’d forgotten to take out our call centre’s number as a fallback, so I’d accidentally placed 50 simultaneous silent calls to them.  More than once I think.  They told me so.  Sorry!

Unsolved problems

I did say I was nearly ready… all that remains over the next couple of days is:

  • when a single call comes in, all four lights on our phones flash up, which is panic-inducing as it makes me think all our customers are calling at once.  That’s never a good sign.  How do I keep the robust signalling but remove all the blinking lights?
  • how to make sure the calls are routed through the nearest phone server?  I don’t want to be talking to our customers via the Netherlands if I don’t have to be;
  • is it worth trying to automate re-routing of inbound calls?  Currently I have assigned a magic phone number to each server which I can call to make it "claim" the main phone number from Magrathea.  Trying to automate it seems risky.
  • do I need to worry about local storage on each server?  I am adding a few features to be able to leave "sorry we know about problem X" so that incoming callers know we know about problems when they happen.  But I would prefer not to worry about updating each server individually with the same message.

So once these are fixed to my satisfaction, we should be moved over next week.  Since real live Asterisk setups seem thin on the ground, I might publish this setup as a worked example (once it’s worked for a month or two, ha) if anyone is interested.

And while it’s bedding in, the same strategically-placed servers will also run self-contained copies of our main web site, forum and support email facilities, and I’ll be testing the failover for those.

Oh my god, I’ve just noticed, it’s a cloud.  I’ve made a cloud.  I’ve even drawn a bloody cloud.