custodian: a network monitoring program

This is the first in a series of posts introducing some of the free software that Bytemark have written, which ought to be of interest to system administrators. We’ve technically “released” them on our projects server, but in 2013 we’re trying to really polish & get them into wider use.

Our latest is called custodian, a network and service monitor which allows us to test for a set of ideal network conditions, and report when any of those conditions aren’t met. It does this by firing a variety of probes (ping, SSH, LDAP, DNS) to remote destinations and reporting on their success or failure.

This kind of monitoring felt like it should be simple to implement, and very reliable. In practice, monitoring at scale is hard. Over the years we’ve experimented with several different approaches and almost every single tool we’ve tried has been unsatisfactory in some regard, usually down to problems of scale.

We want to be alerted of outages promptly, but we also want to monitor thousands of services on our network. That means we have to be able to run 2000-5000 tests as quickly as possible, so that we can detect a failure in our infrastructure, or a host belonging to one of our managed clients.

We also wanted to be able to express our tests very quickly and concisely, which resulted in this pleasing, if slightly wacky, syntax: must run http with content 'Symbiosis is an easy-to-use hosting environment'. must run http with content 'Debian' otherwise 'Bytemark Mirror: HTTP failure'.
APPSERVERS must ping otherwise 'Bytemark application server'.
APPSERVERS must run ssh otherwise 'Bytemark application server'.

Different software over the years has proved unsuitable, for example nagios is a wonderful monitor but couldn’t cope with the volume of tests we were running because of its reliance on running a UNIX process per test.

Our previous monitor was a single process which did everything internally, and in a single thread, so short of adding more RAM it was destined to get behind sooner or later- one thread just isn’t enough when you’re monitoring a lot of servers. (The general case is fine, if all your servers are “up” probes are typical fast to return. But as soon as a host is down you’ll start to see timeouts probing it, and that slows down all further tests that are backed up behind it).

custodian sorts our monitoring in a more scalable fashion. The tests we wish to perform are parsed from a config file (exactly as in the snippet above). They’re added to a queue as they’re parsed, which completes in a couple of seconds. For the queue we’re using the the popular beanstalkd.

Then checks are pulled from the queue by a number of workers, each of which can run one check at a time, and report on a result. So the resulting flow of work looks like this:

custodian works because it allows us to easily add more nodes to actually perform the tests, without changing a line of code or complicating the deployment. As things stand we’re running the “parser” and four of the “worker” processes on a single machine and that is sufficient to run all our tests in under a minute, and notifications come in very quickly as a result.

We’re looking forward to expanding the scope of our monitoring due to this new-found efficiency.

custodian can notify by email, or into a redis database for test purposes, but its main target is our alert management system mauve, which we’ll write about next time. For now custodian should let you run your network checks really quickly, as well as making sure those checks are changeable at any time. So if that interests you, check out custodian’s project page.