15 Mar 2010

2008 Server Freeze, Hyper-V or Volume Shadow Copy Bug?

We have been scratching our heads over a very strange problem for the last 4 weeks which causes two new servers to lock up for up to 2 hours after logging on after a reboot. They’re running Windows 2008 R2 with Hyper-V and Windows Server Backup roles installed.

After trying plenty of ideas to eliminate the problem it was pointed out to us by a Microsoft support guy that our System hive file was 343MB in size. It’s only supposed to be 15 to 20MB. I exported it as an ASCII file from regedit and opened it in Notepad. I counted 24,000 entries for VSS Snapshot devices! When Windows boots it tries to process 24,000 devices which causes it to choke killing the server for two hours – although the VMs limp on underneath and the host responds to pings but both the remote and local console is completely frozen.

Example registry entry:

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Class\{533C5B84-EC70-11D2-9505-00C04F79DE2F}\0349]
"InfPath"="volsnap.inf"
"InfSection"="volume_snapshot_install"
"InfSectionExt"=".NTAMD64"
"ProviderName"="Microsoft"
"DriverDateData"=hex:00,80,8c,a3,c5,94,c6,01
"DriverDate"="6-21-2006"
"DriverVersion"="6.1.7600.16385"
"MatchingDeviceId"="storage\\volumesnapshot"
"DriverDesc"="Generic volume shadow copy"

Trying to delete the snapshots using vssadmin from the command prompt threw this error: “Error: Snapshots were found, but they were outside of your allowed context.  Try removing them with the backup application which created them.

So the question is what is causing 1000’s of VSS (volume shadow copy) snapshots to be created? A clue was found in the system event log when Windows Server Backups runs: “Failed to delete the shadow copy (VSS snapshot) set with id '1A1938A0-1590-4BF4-8173-20DF5FD69E36' in the running virtual machine 'MGT01': Unspecified error (0x80004005). (Virtual machine ID A3F941F1-ED7F-48E9-9CD7-CB7C28A6604A)

We’re using Windows Server Backup (WSB) to take incremental backups every 30 minutes for a bare metal restore of the host and its Virtual Machines. That’s 48 backups a day of 14 VHDs for 42 days that the servers have been running for. Do the maths and that comes to 28,000 VSS snapshots. Taking into account that some backups failed to run and we stopped backups for a few hours here and there, this tallies with the 24,000 devices I counted in the registry. Bingo!

So the bottom line is that the VSS writer creates a snapshot for each VHD at backup time but for some reason isn’t deleting the entries from the registry, although it is deleting the actual snapshots otherwise we’d have run out of disk space by now. Everything points to a bug in either the VSS writer or perhaps WSB or Hyper-V. They’re so tightly integrated during the backup process it’s hard to say which of the 3 is the culprit.

Since this problem is reoccurring on two new servers from Dell we are sure this isn’t a one-off freak incident. There is only 1 other similar incident reported on the web and that was a year ago on a HP server using BackupExec with the Hyper-V aware option. I’m waiting for Microsoft to get back to me, although I’ve been warned that even if they admit it’s a bug it could take a long time to produce a fix. We’d love to know why 1000’s of people who use Hyper-V and take frequent backups aren’t experiencing the same problem. There is no other software installed on the host apart from standard Dell drivers. Weird!

15 comments:

  1. this is more common than you think

    ReplyDelete
  2. Microsoft tell me there is only 1 other reported case of this happening. If it's common then why aren't more people reporting it to Microsoft? We can't be the only people on the planet doing frequent backups of VMs through the host - unless other people give up and migrate to VMWare and an alternative backup plan?

    ReplyDelete
    Replies
    1. Mainly because its a pain in the A$$ to call Microsoft. Get blamed for the issue/asked about licensing, then accused of stealing said legit software. All to find that their resolution is to re-install the server 80% of the time.

      Not my first tango at the rodeo. Ima bit bitter lol :D.

      Delete
  3. Thanks Gary.. I have been bashing my head against a wall for the last 6 hours trying to work out what is going on with our servers. We have the same issue.

    I will let you know if I find a resolution.

    ReplyDelete
  4. Cameron, we have a support case open with Microsoft. It's now been passed on to their development team who are looking at how to fix the bug. I'll post an update when I hear back but if you find a workaround please let me know. Thanks.

    ReplyDelete
  5. Have you tried exporting those registry to a test machine and reboot to see if the same symptom will happen?

    ReplyDelete
  6. Hello Anon. :-) I doubt another machine would boot if it uses another machine's registry. We have two identical servers exhibiting this problem so it's not a one-off incident. Microsoft have admitted it's an issue with Windows so it can really only be fixed with a patch for VSS or Hyper-V.

    ReplyDelete
  7. We are experiencing the exact symptom as yours. I was also informed that it's happening on 3 un-identical servers

    On 1 of the servers:

    HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Class\{4D36E967-E325-11CE-BFC1-08002BE10318} has over 5000 devices listed.

    I exported this key onto a test machine and it didn't reproduce the problem. Can you try do the same? It would really help in figuring this out :)

    ReplyDelete
  8. It's not just that key which has 1000's of entries on our systems...

    HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Class\

    HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\DeviceClasses\

    HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Enum\STORAGE\Volume\

    I suggest exporting from all of these keys. There may even be a 4th and 5th key with 1000's of entries. I can't quite see what your experiment is trying to prove?

    We noticed that the more registry entries we accumulated the higher the chance there is of the machine freezing after a reboot. There is also a random element because sometimes our machines wouldn't freeze for a couple of days (despite a dozen reboots) and suddenly it would start freezing every single time.

    One server that we haven't run devnodeclean on is now failing every single backup job. And Storage Manager goes screwy by sometimes showing a red cross over each hard disk's icon. I suspect there are 40,000 VSS devices/snapshots registered on that server and I'm scared to reboot it in case it never unfreezes!

    ReplyDelete
  9. Has there been any further update with this issue. I am experiencing same issue with Server 2003 and Veritas Backup Exec v10.

    Event iD 25 : Shadow copy of volume were deleted as shadow copy could not grow is last entry in System Events before server locks. Does not come back until I turn off and on?

    ReplyDelete
  10. No fix yet, but Microsoft are looking at the best way to tackle it that has the least impact on other parts of Windows. (I take that to mean there are a couple of ways to potentially fix the problem but there are pros and cons with each one, especially on the amount of testing to ensure a fix doesn't cause problems with other parts of Windows.)

    I will blog again as soon as a fix has been made available.

    ReplyDelete
  11. Gary, have you had this issue resolved with Microsoft? I appear to have the same issue on one 2008 Standard SP2 server (a Dell PowerEdge 2900 running Hyper-V). If you have a case number with Microsoft could you share it so the tech I am working with can refer to it? Great info in your blog. Without this I would have never waited three hours for the server to responsive.

    ReplyDelete
  12. Hi Tom. We're testing a private fix at the moment. I don't think I can talk about it publicly at this stage, but they have pinpointed the problem and come up with a very sensible fix. I have no idea when it will be released (we can't go into production using the test fix) but I really hope it's in the next week or two.

    I tried to find your email address but your blog is empty. Drop me a line at gary -at- directpath dot co dot uk.

    ReplyDelete
  13. Steve Stanfield25 May 2010 at 21:36

    We are having the same freeze up issue and now VSS is failing. Has there been any MS fix to this yet????

    ReplyDelete
  14. Steve, please see my latest blog entry for details of the Microsoft hotfix for this issue:
    http://garysgambit.blogspot.com/2010/05/windows-2008-hyper-v-vss-backup-bug.html

    ReplyDelete