How Bytemark are moving between data centres
As you may know, Bytemark bought its own building and built a data centre in York back in 2013. Previously, we had been renting data centre space in Manchester since 2008.
Initially, we didn’t think we would go to the hassle of moving people across. Until we got our renewal quote from our suppliers in Manchester. It was huge!
As a result, in July 2014 we made the decision to move two-thirds of our servers in Manchester to York. The total number of servers moved or retired by this December will be nearly 1000. These are servers that customers expect to be online 24/7, many with difficult network setups.
We have done server moves before, and we learned a lot of valuable lessons from those events, but we had not attempted anything on this scale. We needed to keep disruption to a minimum, schedule the moves as precisely as we could, and keep all the processes in our own hands.
7 Steps for a successful data centre move
So here’s the story of how we successfully moved our data centre on the night of the 23rd September, and what we have learned from the process — from our initial planning to the finishing touches.
Step 1: Set a schedule
We added all server data into a shared spreadsheet. This turned out to be the lynchpin of the operation, and a great tool for real-time co-ordination.
We could have built a Rails application, database and so on, but that takes time! You can’t beat the flexibility and speed of a spreadsheet for a one-off, multi-team, multi-site co-ordination. Plus, you can get results quickly.
Our goal was to empty 2 full cages; about 750 servers were identified as being candidates for moving. We tried to stick to the “easy” moves first and approached the customers with more complex configurations later on.
We divided the moves into 5 nights and each night had 4 or 5 separate vans which would leave Manchester at half-hour intervals.
Every server went onto the sheet: old and new rack locations, IP address, dates, times and so on. From that point on, any changes to the plan were reflected clearly on the spreadsheet, and any member of the team could update it safely.
Step 2: Tell the customers
Once we had a schedule, we emailed the customers, several weeks in advance, to let them know that we were going to move their servers. We could tell them when they were going to get switched off, and (based on 90 minute journey time, plus loading and unloading) a reasonable estimate for when they’d be back on again.
Our email script read the move windows from the spreadsheet and opened tickets through our ticketing system (that’s another win for the spreadsheet “solution”).
We had also offered an invitation to move to York, this helped to identify which customers were most willing to move. We considered this along with our schedule.
Step 3: Tell them again, differently, and keep track of replies
The same script emailed our customers again at weekly intervals with a slightly differently-phrased message, to try to ensure our first email hadn’t been ignored. And the final email was different again.
Subject: <%= server %> is being moved on <%= date %> at <%= start %> Subject: REMINDER: <%= server %> is being moved, evening of <%= date %> Subject: FINAL WARNING: <%= server %> is being moved evening of <%= date %>
We changed the wording in the body to try to try to get people’s attention any way we could, because we didn’t have the staff to ring round every customer, and needed to make sure we got heard!
Our script used RT‘s excellent command line interface to open new support tickets for customers, which lets us track replies. It is really useful to track each customer’s move individually through RT, rather than just send untracked email blasts. When customers replied, we had a full email chain recorded, and we could set global options per ticket which allowed the data centre team to search and categorise correspondence. RT’s custom fields are handy like that:
This semi-integration between RT and a shared spreadsheet didn’t always look pretty, but allowed everyone to move quickly, and coordinate their work in batches.
Step 4: Put stickers on everything
Before each moving day, we labelled up every server that was going to move. We included its name, exact destination and the move window (so A = the 9.30 pm van, B = 10.00 pm etc.). That let us spread the work of selecting and moving the servers among a lot of people. Just look for the big letter, shut the server down at the keyboard, and pull it out.
Step 5: Alter our databases & network
As we started the moves, we had a lot of internal systems to update.
Our rack organiser keeps track of all the servers, racks and cabling between them. It only lets us update one server and cable at a time, so earlier in the evening, Peter had a lot of clicking to do, updating every server that was going to move.
Likewise, our network needed reconfiguring so our routers would know that these servers’ IP addresses were in York rather than Manchester. Again, this has no automation involved and Tom needed to perform each configuration step by hand. We’d organised the vans by network so he could move each network as the servers were being powered off.
We debated whether to build some special-purpose automation to make this process less boring, but decided against it. Because:
- It seemed unlikely to take much less time than it would save
- We’d end up with user interfaces that would not be tested again for years
- well-executed, well-rehearsed, ‘boring’ processes are less likely to go wrong
There is certainly more automation we could build into our network and rack management, but it’s not the kind of thing we’d throw together for one occasion. So that night, there was a lot of clicking, and the job got done.
Step 6: Get the team, and the vans, ready
The vans lined up in Manchester at 9pm and our team powered servers off and packed them, 20 or 30 at a time. Each van leaves at half-hour intervals and is greeted warmly by the York team, pictured here at midnight:
The physical side of the move was a production line, starting with unloading the van:
Our DS27 cases have been great long-serving chassis, but half of them needed their fans turning around to fit the front-to-back airflow in YO26.
[Our data centre has contained “pods” where the cold air is piped up in the middle and gets blown through into the main space of the DC. It’s more efficient because we are using less cold air, but servers won’t stay cool if they’re not sucking in the chilled air.]
So before we could re-rack these servers, the next stage of the production line was to sort them and open up those that needed their fans reversing:
As each server was racked, Sam operated the shared spreadsheet to say that he’d powered up a server and it looks like it’s booting:
One by one, the servers were racked and checked in on the spreadsheet:
Step 7: Do they boot?
That was the cue for our sysadmin team upstairs to log into the box, check that the serial line was working, and that there was no reason the box wasn’t booting. So we looked for long-running filesystem checks, dead disks and botched kernel upgrades. Once each box had a login prompt and pings, we updated the spreadsheet to say the box was done! Then we moved onto the next.
At the same time our “on-call” sysadmin was watching for alerts; Nat came into the office for the occasion (usually for this role, staff work from home, waiting for texts).
The alerting system lets them know if a customer has raised an urgent support request, a monitoring failure of a managed client, or various other automated checks. Many of these will be expected, and Nat’s job was to deal with the unexpected. Answering the phones, as well as dealing with the usual load or urgent support requests unrelated to the move.
Between the failures we expected, and the failures that we didn’t, we got down the list until every server was back up and running, and marked the customer tickets as Resolved.
And that’s it — every box was moved, every one checked, and everyone could go home at around 5.30am. Well, it was nearly that smooth…
What went wrong: The “Never Events”
What we were trying to avoid was taking customers offline unexpectedly. Every customer was notified and agreed to be taken offline for a particular time, or we made some other arrangement with them.
But in the two big moves so far, we’ve made a couple of errors. That is, boxes that should have moved didn’t, or boxes did move that shouldn’t have. In both cases, we could make quick network fix-ups to get them back online on the same IP address, but it took a puzzled customer to notify us.
A British hospital would call these “Never Events”, things that should be optimised away at any cost. So we’ve gone back to the details of our communication between teams to ensure the same thing can’t happen, at least not for the same reasons.
Beating fatigue
Once the servers were all racked up, hooray, we’re nearly done! thought everyone. When we’re looking to get them all booting, there is a human tendency to go through the long list of servers, fix the easy problems. That shoves the non-obvious problems to the end of the night.
But that’s a mistake; by doing the easy servers first, we didn’t spot systemic problems e.g., a network misconfiguration in York affecting several servers. And fatigue makes trying to solve those problems much worse.
So we needed (at least) to ensure that we avoided our tendency to tick off easy cases, and to dig into deep problems sooner.
We were also considering a third wave of system administrators to start work at 3 am to look at these fresh, allowing the earlier move team to go home. That team might spot those systemic problems more quickly, but at the risk of spreading our team thinner. We’re still deciding what to do for the next move.
Wasting time
All our customers should do a test reboot before being moved, right? We emailed and we emailed and said pleeeeeease reboot, just once. But we ended up…
…a little disappointed at the number of people who didn’t perform that test. We don’t hold it against our customers of course!
But it’s proof that we should be trying to annoy people into doing it because flushing these problems out beforehand can prevent issues with customers’ machines and reduces downtime on the night. I think our reminders could be improved, and I can risk being more annoying where we know it hasn’t been done yet.
Finally, we did spend two hours sweating over a broken server that had been cancelled by the customer years earlier. No customer suffered, but palms hit faces and fists were shaken at the sky. That was just a reminder to check over our cancellation records.
What’s left
At this point, we’d done 2 out of 5 big moves. I’m pleased that we’ve managed to coordinate this without any panic from staff or customers.
For the remaining three moves, we will add further discipline, communication, and double-checking to ensure these “never events” are gone.
And that’s it! Many, if not all, gory details of our server moves. Hopefully the last big ones we’ll need to do now that we own our own place.