Thanks to the all the folks who’ve signed up for our beta at http://bigv.io/
We’re polishing BigV to a shiny finish before the beta software goes out in June. It’s not the finished product, but we think we’ve got enough to show off how the future of hosting should work (cue trumpets). We have 512GB of RAM to play with, 64 CPU cores, 12TB of regular discs, and 3TB of SAS discs. We may be pushed for space to accommodate all the beta testers, because we want those people to have the run of the hardware.
I say "the future of hosting", but unlike lots of other hosts, Bytemark are still going to be selling servers, just like the old days. But BigV will give you flexible billing, so you can fire up and pay for a server for only a few hours – if that’s all you need. It will bring flexibility, so you can change your servers’ RAM or disc space instantly too. And it will bring resilience: so we will have the capability to shift customers’ servers around our cluster of hardware, if we think any of it is going to fail, or needs maintenance, or an upgrade. If we’re feeling particularly clever this might become automatic.
Too clever by half
But as Amazon’s monstrous outage shows (an outage that would bury any other host’s reputation for reliability) it is possible to overthink failover protocols. We recognised this as the biggest risk with BigV and I thought you might be interested to hear about our architecture in more detail. The promise that "it’s a magic cloud, and you don’t need to worry" won’t persuade anyone for much longer. You’re going to want to know the risks you’re taking by subscribing to a big virtual hosting platform.
I will shock you to your core by telling you that Bytemark’s current virtual machine product is a set of hairy Ruby, shell & Perl scripts. It was originally written in about eight weeks while I was being under-stimulated in a temp job in 2002. The scripts got passed around in "maintenance mode" for many years and have survived various attempts to "rewrite them properly", including one to use Xen (we dodged a bullet there).
But they do a lot, and we all know how they work, and how they fail. More importantly their virtual machines’ uptimes are mostly in the hundreds of days, spoiled only by the occasional hardware upgrade.
The same, but better
This is still what we want – long up times, permanent discs, and easy upgrades. And we still think that’s what our customers will want.
The most important things we wanted to add were:
- reliable live migration – so we can upgrade our hardware without the laborious work of emailing customers, and spoiling their up times;
- VM snapshots – so a customer can "checkpoint" their whole system before a major upgrade, and back out if it goes wrong;
- access to all of KVM’s great features – graphical consoles, installations from CD, direct network access, and anything else they’d be able to do if they had the server in front of them;
- a handy tool for provisioning, server upgrades and maintenance; a uniform interface to the software, rather than the 1980s text console you get at the moment (I should say "as well as");
- really flexible storage, so that servers could use terabytes, not just a few tens of gigabytes;
- a sane software development process and test rig, so we could add features to our live system without errors.
With BigV we’ve turned our simple system into a distributed one, full of features and with (maybe) the minimum possible complication. To do this, BigV has three types of servers instead of one:
The Brains hold the database of all virtual machines in a BigV cluster, and run the gateway for customers to issue requests for servers.
The Heads are packed full of CPU cores and memory, and run the KVM processes, aka virtual machines. They don’t have any storage.
The Tails are high-spec servers, but have RAID cards, a normal amount of RAM, and lots of directly-attached discs which can be hot-swapped.
The heads and tails are always connected to the brains, and one of the brains takes on the role of master brain. That’s the one that keeps a complete list of every virtual machine and disc in the cluster, and that’s what you (the customer) talk to when you ask to provision a new VM.
The heads and tails are also connected to a 10-gigabit storage network, so that the KVM processes can talk to their discs really quickly.
The brain can decide to move either virtual machines or discs between any pair of heads and tails, without having to reboot affected systems. So that gives us our hardware nirvana - no live customer system need ever be tied to a piece of hardware again.
How it’ll screw up
I’m still mapping out the ways in which this system will break, and may find out a few more during testing – the main one is network segmentation. We rely on a lot of different local networks between the heads, tails and brains. If those get misconfigured, the worst hazards might be bringing down every VM at once or freezing disc I/O.
If a head is disconnected, the brain can simply spread its VMs around to other heads. The rule at the moment is that this happens after an unexpected disconnection period of two minutes. If the head gets back in touch, nothing happens. If it doesn’t, the head is under instructions to kill all of its VMs, and the brain will assume this has happened.
With tails, if your data is stuck on a "broken" tail, your data stays disconnected until that gets fixed. However there are humans in this system too, and well-tested RAID setups, and redundant power supplies, and mirrored memory, and SAS switches. We know how disc-based systems break, we monitor them, and we’ll fix them. We’d rather run the risk of a couple of hundred VMs going down in rare circumstances, for a few minutes, than risk people’s data with an automatic recovery mechanism.
In a situation where random cables are pulled out of a BigV cluster, and then put back again, we expect affected VMs (which could be all of them, depending on which cables are pulled!) to do one of two things: freeze or reboot. Nothing worse. And it should be a stable system – once it’s put back together, everything will reboot the way it was. There are a few safeguards to prevent two copies of a customer’s VM from running, but I’m still trying to imagine the kind of failure that would make this possible.
That still sounds really complicated, but still as simple as I could make it while fulfilling our objectives. Assuming it works well in testing, BigV will open a lot of doors for us commercially, letting us offer servers that nobody else is. Plus we have all the benefits of an in-house technology.
The beta is still on for June, so if you’ve not already expressed an interest, head to http://bigv.io/ and do it!