15 Mar 2010

2008 Server Freeze, Hyper-V or Volume Shadow Copy Bug?

We have been scratching our heads over a very strange problem for the last 4 weeks which causes two new servers to lock up for up to 2 hours after logging on after a reboot. They’re running Windows 2008 R2 with Hyper-V and Windows Server Backup roles installed.

After trying plenty of ideas to eliminate the problem it was pointed out to us by a Microsoft support guy that our System hive file was 343MB in size. It’s only supposed to be 15 to 20MB. I exported it as an ASCII file from regedit and opened it in Notepad. I counted 24,000 entries for VSS Snapshot devices! When Windows boots it tries to process 24,000 devices which causes it to choke killing the server for two hours – although the VMs limp on underneath and the host responds to pings but both the remote and local console is completely frozen.

Example registry entry:

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Class\{533C5B84-EC70-11D2-9505-00C04F79DE2F}\0349]
"InfPath"="volsnap.inf"
"InfSection"="volume_snapshot_install"
"InfSectionExt"=".NTAMD64"
"ProviderName"="Microsoft"
"DriverDateData"=hex:00,80,8c,a3,c5,94,c6,01
"DriverDate"="6-21-2006"
"DriverVersion"="6.1.7600.16385"
"MatchingDeviceId"="storage\\volumesnapshot"
"DriverDesc"="Generic volume shadow copy"

Trying to delete the snapshots using vssadmin from the command prompt threw this error: “Error: Snapshots were found, but they were outside of your allowed context.  Try removing them with the backup application which created them.

So the question is what is causing 1000’s of VSS (volume shadow copy) snapshots to be created? A clue was found in the system event log when Windows Server Backups runs: “Failed to delete the shadow copy (VSS snapshot) set with id '1A1938A0-1590-4BF4-8173-20DF5FD69E36' in the running virtual machine 'MGT01': Unspecified error (0x80004005). (Virtual machine ID A3F941F1-ED7F-48E9-9CD7-CB7C28A6604A)

We’re using Windows Server Backup (WSB) to take incremental backups every 30 minutes for a bare metal restore of the host and its Virtual Machines. That’s 48 backups a day of 14 VHDs for 42 days that the servers have been running for. Do the maths and that comes to 28,000 VSS snapshots. Taking into account that some backups failed to run and we stopped backups for a few hours here and there, this tallies with the 24,000 devices I counted in the registry. Bingo!

So the bottom line is that the VSS writer creates a snapshot for each VHD at backup time but for some reason isn’t deleting the entries from the registry, although it is deleting the actual snapshots otherwise we’d have run out of disk space by now. Everything points to a bug in either the VSS writer or perhaps WSB or Hyper-V. They’re so tightly integrated during the backup process it’s hard to say which of the 3 is the culprit.

Since this problem is reoccurring on two new servers from Dell we are sure this isn’t a one-off freak incident. There is only 1 other similar incident reported on the web and that was a year ago on a HP server using BackupExec with the Hyper-V aware option. I’m waiting for Microsoft to get back to me, although I’ve been warned that even if they admit it’s a bug it could take a long time to produce a fix. We’d love to know why 1000’s of people who use Hyper-V and take frequent backups aren’t experiencing the same problem. There is no other software installed on the host apart from standard Dell drivers. Weird!