Custodian is a piece of free software written by Bytemark for system administrators. It is a network and service monitor which allows us to test for a set of ideal network conditions, and report when any of those conditions aren’t met. It does this by firing a variety of probes (ping, SSH, LDAP, DNS) to remote destinations and reporting on their success or failure.
Why did we create it?
This kind of monitoring felt like it should be simple to implement, and very reliable. In practice, monitoring at scale is hard. Over the years we’ve experimented with several different approaches. But almost every single tool we’ve tried has been unsatisfactory in some regard, usually down to problems of scale.
We want to be alerted of outages promptly, but we also want to monitor thousands of services on our network. That means we have to be able to run 2000-5000 tests as quickly as possible so that we can detect a failure in our infrastructure, or a host belonging to one of our managed clients.
We also wanted to be able to express our tests very quickly and concisely. Different software over the years has proved unsuitable, for example, nagios is a wonderful monitor but couldn’t cope with the volume of tests we were running because of its reliance on running a UNIX process per test.
Our previous monitor was a single process which did everything internally and in a single thread. So short of adding more RAM it was destined to get behind sooner or later. One thread just isn’t enough when you’re monitoring a lot of servers. (The general case is fine, if all your servers are “up” probes are typical fast to return. But as soon as a host is down you’ll start to see timeouts probing it, and that slows down all further tests that are backed up behind it).
How does custodian work?
custodian sorts our monitoring in a more scalable fashion. The tests we wish to perform are parsed from a config file (as shown below). They’re added to a queue as they’re parsed, which completes in a couple of seconds. For the queue we’re using the the popular beanstalkd.
http://symbiosis.bytemark.co.uk/ must run http with content 'Symbiosis is an easy-to-use hosting environment'. http://mirror.bytemark.co.uk/ must run http with content 'Debian' otherwise 'Bytemark Mirror: HTTP failure'. APPSERVERS must ping otherwise 'Bytemark application server'. APPSERVERS must run ssh otherwise 'Bytemark application server'.
Then checks are pulled from the queue by a number of workers, each of which can run one check at a time, and report on a result. So the resulting flow of work looks like this:
custodian works because it allows us to easily add more nodes to actually perform the tests, without changing a line of code or complicating the deployment. As things stand we’re running the “parser” and four of the “worker” processes on a single machine and that is sufficient to run all our tests in under a minute, and notifications come in very quickly as a result.
Here are some of the reasons we found custodian more useful than alternative software:
- Diversity – custodian can send you notifications by email, into a redis database for test purposes, or into mauve. That’s our alert management system.
- Speed – custodian lets you run your network checks really quickly.
- Flexible – checks carried out by custodian are changeable at any time.
How to use custodian: An example case
So, we naturally think custodian is great (we did write it after all). But what is a real-life situation where it is useful? For improving the security of our customers’ hosts!
As a hosting company, we like to make sure that our IP-space is not used to attack, compromise, or abuse the network.
One of our duties is responding to abuse complaints relating to users who have been unlucky enough to have had their machines compromised, so that they start scanning for security issues, or sending spam emails.
Although we appreciate hearing of abusive hosts within our network we wanted to be able to spot these ourselves. To do this, we launched an internal scanner which will scan our network space looking for poorly configuration software.
These scans are performed by custodian. The configuration issues it looks out for are as follows:
- Open SMTP-relays, allowing spam mail to be sent through them
- Open recursive DNS servers
- Open HTTP proxy servers
We’re planning to scan our network space at least once a month, which should be often enough to detect real problems, but not so often that we’re showing up in log-files, or suffering too much administrative overhead.