SBS 2011 bare metal restore failure
SBS 2011 bare metal restore failure
The single best feature of Microsoft Small Business Server 2008 / 2011 is without a doubt the ability to do a full bare metal restore. In other words, if your server dies, you can restore it to exactly how it was at your last backup, and in record time too! This is something that we test on a regular basis and on every server that we install.
This weekend we were running through a SBS2003 to SBS2011 swing migration. Traditionally when we move clients away from their SBS2003 servers we would go in at a weekend, remove the old server, install the new one, transfer the data and then reconfigure every single computer on the network. This was a very time consuming and mentally draining excercise especially when you have 30+ PCs to reconfigure. Not something i’d like to do every month!
Microsoft have released an official migration path to SBS2011 but this has its pitfalls to; the need to carry out some reconfiguration at all PCs and the fact that you have 7 days to do complete the migration. If you over run this time frame due to technical issues you then need to do a bare metal restore to before the migration started and all the hassle that that entails.
The method that we have just started to use is the Swing Migration by Jeff Middleton. Jeff is a bit of a legend in the SBS community and I had the pleasure of meeting him recently at the Edinburgh Microsoft/HP SBS community roadshow. The swing migration allows you to perform the bulk of the SBS2003 to SBS2011 migration offline with the client still working away on their old SBS2003 server. After this, there is minimal client reconfiguration needed and it’s much simpler on transition day. The downtime comes when we need to actually move the client’s data from the old server to the new one. Swing Migration is something that we’ll definately be using more of.
It was during a swing migration that we had a problem that really threw us. We were preparing the client’s new Dell PowerEdge T410 server here in our office getting ready to install it the next day. While installing Microsoft updates on the client’s new SBS2011 server here in our office, we installed a SharePoint patch http://support.microsoft.com/kb/2553413 ran PSConfig (as it seems we have to after all sharepoint updates) and then rebooted. After the reboot we tested sharepoint and got a “Service Unavailable, HTTP error 503” (how come we always get errors with sharepoint???!!!). We troubleshooted this issue for over an hour but didn’t get anywhere really so we we decided to restore the server to a point a couple of hours earlier using the normally excellent Windows Server Backup utility that is part of the operating system. One hour later and the server boots back up but instead of getting a log in screen we get a black screen with a white mouse curser. We wait and wait and wait but nothing happens. A bit of researching and we find the cause of the problem. As part of the Swing Migration, you are advised to install http://support.microsoft.com/kb/974674 Windows NT Restore Utility which allows you to transfer the client’s data from the old server to the new server. It seems that once this utility is installed subsequent backups are corrupt preventing a full bare metal restore. This is disastrous!!! it’s a pretty rare issue but one that I feel we’ll see cropping up on blog sites soon enough. I’m just glad we found it in a pre-production environment and not in two years time during a disaster!
I’ll be contacting Microsoft to let them know about this issue and will also contact Jeff Middleton so that their procedure can warn against this. Hopefully this will save someone’s business somewhere down the line.
For us, it means a failed Swing Migration, a few lost a few days worth of work and a delay to the client for getting their new server but we learned a very good lesson so in the end it was kind of worth all the hassle. So the moral of the story? test your backups regularly, not just restoring a single file, but the whole server. I also like to use two backup methods to counter a single bizarre issues like this. I’ve been in enough disaster recovery situations to know that the more backups and backup types the better!