Some BigV growing pains

If you’re using BigV, you might have seen this message recently:

INFO: task jbd:/dev/vda3:xxx blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

For this: we’re sorry! I know many of you have seen these ‘I/O waits’ as we rearrange our storage pools, fixing problems on the fly. That amounts to poor service, and we are aiming to do much better. To our growing band of users, thank you so much for sticking with us and reporting when you’ve had an issue. We are buffing BigV to the same high standards as our existing 10-year old VM service and it has taken a little longer than I wanted.

BigV virtual machines should behave exactly like supremely-flexible dedicated systems. If you haven’t seen how high we’re aiming check out my paper [PDF] from 2012 describing the architecture (which will explain my mentions of “heads” and “tails”). We think your cloud should give you high uptime and permanent storage – not a million reasons to blame yourself for “not using it right” or “not using enough of it”.

It didn’t help that the Saturday before last (27 October), our network got hit with the worst denial-of-service attack we’ve ever seen. That dissipated after less than an hour, but BigV suffered for longer due to some stingy network provisioning. The attack cut off our storage network for many customers, and a few reboots were needed to get them moving again. We understood the issue, and that has now been fixed.

Any network is vulnerable to being cut off by a sufficient mass of abusive traffic, but at least we are certain that BigV’s operation can’t be disrupted in the same way again.

The best tools – we have them

When Peter and I started Bytemark in 2002, we were selling 64MB virtual machines on hosts with a Pentium 4, 2GB RAM, and two discs. There was no live migration – there wasn’t even not-live migration! If a host went wrong we just had to fix it. It was 3 years before I wrote a “move everything to another host” script (that made us quite good at fixing broken hardware). The virtualisation was slow, and the overall performance really hard to maintain. One misbehaving customer could drag the whole system down, we didn’t even have iotop to tell us who it was.

So much magic has happened since then.

Given the development of KVM and our own flexnbd “SAN-lite” project, our tools are now the stuff of stupendous luxury. But I’m rather sheepish that we’ve not used them very well so far. We can, and do, migrate virtual machines between both heads and tails. To date, we’ve moved everyone’s VMs between two data centres – twice. Those migrations were (almost) invisible, we didn’t announce them, and they went as planned.

The problems from the last four weeks have been:

1) Kernel compatibility with the high-end hardware that we’re buying.

Even with up-to-date versions of Linux, 10Gbps networking using IPv6 doesn’t seem solid without some surprising tweaks. We’ve seen nasty kernel crashes when sufficient traffic is coming in. Our incredulity suppressed perhaps the obvious and pragmatic solution: deploy more hosts. We’re not even short of them – we just liked the 384GB and 768GB hosts and believed that they should be fixable in a short time.

So as of four weeks ago, we’ve appointed Chris as a kernel builder and stress tester. He is keeping one of our larger hosts out of service (the one that forced a couple of unnecessary reboots on us). We’re also going to deploy lots of smaller hosts (possibly slumming it on 128GB RAM, but needs must). That will spread the risk of a single broken host taking down a lot of customers at once.

The result should be higher uptime all-round.

2) Over-contended storage, our old enemy.

When we deployed revision 373 of our storage server, we finally unleashed the full power of our 10Gbps network storage which was previously held back by my some inefficient code.

That meant VMs could hit the full capacity of our disc pools – but at the cost of drowning them in I/O requests when one misbehaved. We’ve re-learned some old lessons here, and enabled the Linux “completely fair queue” scheduler to stop one person’s I/O storm from causing the dreaded “blocked for more than 120s” message.

The tails themselves were fine and managed as best they could, but there is a cut-off point where the kernel gives up, needs a reboot before it’ll try again, and we should have acted before you (BigV users) saw that..

So in the short term, we have the capability to deploy more hosts and divide up the I/O pools into smaller chunks. We’ll be monitoring closely and using our powers of live migration to avoid trouble before it happens.

Where we’re heading with storage

This leads us to a grander plan for storage – guaranteed I/O rates for all storage grades. We are monitoring our performance across all grades and we’re aiming to publish and stick to IOPS and throughput figures for each grade. We’ll be developing flexnbd to meet these goals after Christmas, as well as deploying a second BigV cluster in our York data centre for split-site hosting.

In the mean time, thank you to the hundreds who continue to use and recommend BigV. We are aiming far higher than the complex and fragile world of “cloud computing” (or “hosting” as I still like to call it) and I love each and every one of you trusting your application to our network.

If BigV isn’t working for you, for any reason, don’t accept it! Contact our support team and we’ll be happy to help isolate your application on our network, be generous with discounts to keep your hosting going, and we’ll be continue to be open about current issues on our outages forum.

As ever, I’m happy to answer any questions privately (matthew@bytemark.co.uk), and will update this post if I’ve missed anything.