25 May 2010

Windows 2008 Hyper-V / VSS / Backup Bug Part III

Good news! We’ve been issued by Microsoft with a public release of the hotfix KB982210, as it will be known. The fix will only work on 2008 R2 and not with any previous releases of the OS.

So how does the fix work? First let me explain the problem more clearly than in previous blog entries. Whenever a device is attached to Windows the Plug & Play Manager creates an entry for that device in the registry. If it’s a USB device, for example, and you unplug it then its entry will remain in the registry so when it’s plugged back in the computer will recognise it and any settings that have previously been set up for it. The same is true for snapshots created by VSS, the Volume Snapshot Service. A snapshot is treated like a device so one a new snapshot is created so too is a registry entry for the device.

Now here’s the problem. The registry entries are not removed – ever. While many users will never have a problem with that there are a number of power users who generate 1000’s of snapshots over a short period of time. For example, in our case where we use Windows Server Backup (WSB) with a backup schedule set to every 30 minutes which includes backing up 14 VHDs used by several VMs on Hyper-V. VSS will create 14 snapshots (one for each VHD) every time a backup is run. That’s 14 snapshots every 30 minutes. That’s 672 a day and over 20,000 per month. See how quickly they mount up and none of the device entries in the registry are being removed.

Severe problems manifest when the host server is rebooted and the registry is processed, analysing tens of thousands of devices, causing the server to look as if it has hung. It freezes for 2 or 3 hours, possibly more if I had let the server carry on taking more backups.

The registry key you need to check in Windows 2008 R2 is:
HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Enum\STORAGE\Volume

There should be 10 to 50 entries in there for a normal healthy machine, depending on how many devices you have attached. On our server we had 28,000 entries!

Now, how does the fix work?

The new hotfix from Microsoft makes a change to the Plug & Play Manager that adds a timestamp (or a tombstone date as they prefer to call it) for each new snapshot device that’s created. This means Windows is now aware of exactly when a snapshot was created and can make a decision to fully remove the device’s entry from the registry after a certain period of time. I’m not sure how long it waits but from our experience it can be counted in minutes rather than days or weeks.

So well done to Microsoft for creating a smart solution to a critical problem. It took a massive amount of effort to get our case brought to the attention of the right person. Before that I had spent weeks working with a Microsoft support team in India by phone and remote desktop trying to explain what the problem was (and no, the problem will not be resolved by re-installing Windows, thank you!). It wasn’t until a Premiere Support case was opened and a Microsoft account manager from the UK got involved who contacted the right technical person that we started to make rapid progress. Microsoft had been aware of the technical issue for a while but our case seemed to have given them the incentive to fully investigate it. And we are grateful that we were finally listened to and some very expensive new servers can finally be put into service.

Cleaning up the registry

The only problem that remains is for other people experiencing this. You need to clean up your registry before applying the hotfix otherwise the freezing symptom will persist. To do this you need a tool from Microsoft called devnodeclean. Sadly this is not available anywhere on the Internet to download based on my Google and Bing searches. Microsoft should be able to email you a copy if you open a support case with them and refer them to KB982210 and this blog entry for good measure. Run devnodeclean without any switches at first to see what it makes of your registry, then use the /r switch to force it to remove the unwanted devices from the registry.

Next you may need to compact your registry if it has become huge, anything over 50MB I would say. Ours peaked at over 450MB. The “system” hive (found in C:\Windows\System32\config\) can be compacted using regchk using the switches /l /c /r /v. Again, chkreg is only available by request from Microsoft and was in fact developed to repair the registry in Windows 2000 but amazingly still works for 2008 R2. Please note that regchk cannot compact a live “system” file. You need to back up your system settings first and run chkreg on a restored copy of the “system” file (restore it to a new folder somewhere else). Then boot up from the Windows setup DVD and enter the recovery console. Rename “system” to “system.old” and copy the compacted system file into the config directory. Then reboot into Windows.

15 comments:

  1. Good add Gary. I have seen the same thing and blogged here about our experiences.

    http://bit.ly/cS4ePK

    Rob McShinsky
    http://www.virtuallyaware.com
    http://www.twitter.com/virtuallyaware

    ReplyDelete
  2. Thanks for the comment and link, Rob. Your blog entry says you were battling with Microsoft for a year on this? I thought I had a hard struggle for 6 weeks!

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hello Gary,

    We are having similar issues and the last install of the hotfix actually caused a reboot loop which I'm associating to a large registry that can't be cleaned up by the hotfix. One of our other servers takes about 5 hours to login (44 VMs @ 1 hour backup intervals for 4-5 months) so we desperately need access to this devnodeclean utility. Can you outline where to open a Microsoft support case? All of my searches are not leading to anywhere. Thank you. If you do have detailed information my email is arvand .at. arvixe.com and I'd extremely appreciate it.

    ReplyDelete
  5. Haven't Microsoft made devnodeclean available to download from their website yet? They told me they would. I don't understand why they are so secretive about the executable version. They provide the source code but you have to compile it yourself! I don't get it. I'll see what I can do for you.

    ReplyDelete
  6. You can find a free compiled tool that will remove the phantom devices on my site.

    http://bytesolutions.com/Support/Knowledgebase/KB_Viewer/ArticleId/35/DevNodeClean-remove-phantom-storage-device-nodes.aspx

    ReplyDelete
  7. I am running Windows 2008R2 on VMware ESXI and am using StorageCraft's ShadowProtect to perform VSS backups of my servers hourly and they will randomly lock up on me. After performing a hard reset and looking at the backup log, the last command that I see is ShadowProtect issuing a command to Windows to perform a VSS snapshot. Any chance that this issue is affecting 2008R2 on VMware as well?

    ReplyDelete
  8. Hi Darrell. The issue is that Windows 2008 R2 has a major bug in it and running it as a guest OS doesn't make it exempt. Sorry!

    The hotfix has been included with Service Pack 1 (SP1) and I urge everyone to install it as soon as possible, especially if backups are run several times a day.

    However, I must say that I did not experience a random locking up issue. It would lock up when trying to log in or a few minutes afterwards. The server could run happily for days, but if you try to log in it would process the registry and freeze.

    ReplyDelete
  9. Hi Gary. Great posts! I have recently encountered a similar issue on a Windows Server 2008 R2 machine but it does NOT have Hyper-V installed. The SYSTEM registry hive has ballooned to 220MB and login takes hours.

    In your original post on 15 May 2010, you mention that there were about 24K entries in the key: [HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Class\{533C5B84-EC70-11D2-9505-00C04F79DE2F}\0349]. In the latest post on 25 May 2010, you mention checking HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Enum\STORAGE\Volume for excessive entries. On this server, we have loads of entries under the first key (Class) but not very many under the second (0349). Does the patch/service pack still address this issue? Also, will the devnodeclean utility eliminate the entries under the Class key?

    We're using Windows Server Backup to backup two partitions on this server. Another issue that cropped up (and I believe is related) is that there are more than 450GB! of snapshots on the D partition of our server which have almost completely filled the partition. Unfortunately, vssadmin and diskshadow have been unable to remove them. Enabling/disabling Shadow Copies on the drive has failed to remove them as well.

    Because of this issue or possibly due to troubleshooting steps, we almost had to reinstall from scratch due to a BSOD Process1_Initialization_Failed (0x0000006B) at startup. Countless hours attempting to start in safe mode and from installation DVD failed with the same error. Deleting the bootcat.cache file (per MS KB) using a Dell OMSA LiveCD failed to resolve it as well. Only after inserting and booting to the Windows Server 2003 R2 CD in an desperate attempt to access the recovery console and run a disk check (which we were unable to do) were we then able to restart the server and login. I can only imagine that the 2003 disc somehow repaired a startup issue.

    It's been an unbelievable experience and one I don't wish on anybody. Now I still have to figure out the VSS space issues. Any tidbits of wisdom you or anyone else can supply are greatly appreciated!

    Thank you!

    Dan

    ReplyDelete
  10. Hi Dan. To be completely honest I have no idea if the hotfix will resolve your issue. The one Microsoft devised was based purely on the bug that I describe in my blog. How many backups and on how many drives does your system run a day? What I'm getting at is how many VSS snapshots does your system create a day when it's not running Hyper-V? I don't understand how it could be creating so many and I'm not sure that it is.

    If you run the hotfix it's unlikely to break your server, even if it doesn't address the issue you're experiencing. The service pack for R2 includes this fix anyway, so make sure you have the service pack installed.

    Have you taken a look at your shadow copy / previous versions settings? It might be worth setting it to 0 and deleting it in case it's gone ctazy/huge. Sorry I can't help further.

    ReplyDelete
  11. Gary, thanks for the quick follow-up. We managed to get into Windows and everything appears to be in good working order now however the login took a couple of hours (as expected). The SYSTEM hive is now 430MB in size. I'm nervous to run the devnodeclean utility for fear of it causing issues. I think I'll wait until I have a full backup of the server first.

    Windows Server Backup was set to backup once per day to an external drive. I recently installed Crashplan Pro however and suspect that maybe the two are not playing nice together. CP utilizes VSS to backup open files but I have it running on other servers and have not encountered this issue (yet).

    I attempted to disable VSS prior to the restart of the server in order to get rid of the snapshots that were taking all the space. It is currently disabled so there is no space used by VSS at this point.

    ReplyDelete
  12. I wanted to follow-up on my previous posts by letting you know that I was able to use DeviceRemover (pro-it-education.de) to remove the 30,000+ detached/hidden Volume Shadow Copy devices from the Windows 2008 R2 server I was working on. The initial process took 18 hours to complete. I had to run DeviceRemover several more times in order to get rid of as many entries as possible. The end result was that I was left with 10 entries!

    Following that process, I backed up the registry files using ERUNT and ran the NTREGOPT utility (http://www.larshederer.homepage.t-online.de/erunt/) to compact the registry files. This resulted in a SYSTEM hive file size of 219MB down from 432MB. Still large but apparently not large enough to affect the login process because I was able to login without issue or delay following the cleanup. I suspect the NTREGOPT utility was simply unable to fully compact the SYSTEM hive file. I still haven't installed SP1 for Server 2008 R2 yet but will be tackling that this weekend.

    I hope this information helps others who might encounter the same issue.

    Dan

    ReplyDelete
  13. I'll second Dan's reply above. I work for an MSP that manages about 300 servers...we run Device Remover in Hidden/Detached mode regularly on all of our host servers, and it works wonders.

    John Anderson
    MCITP:EA

    ReplyDelete
  14. Hi Gary,

    thank you for your great information! I use devnodeclean from bytesolutions for a couple of weeks und our Server2008. Rebooting ist much faster now (from about an hour down to about 20 minutes)!

    I still find some thousands strange entries at
    HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\DeviceClasses\{53f5630d-b6bf-11d0-94f2-00a0c91efb8b}
    such as
    \##?#STORAGE#VOLUMESNAPSHOT#HARDDISKVOLUMESNAPSHOT1000#{53f5630d-b6bf-11d0-94f2-00a0c91efb8b}

    Seems that they are orphaned, but not removed by devnodeclean. Do you have any idea?

    ReplyDelete
  15. If it takes your server 20 minutes to boot after following the processes I described then obviously there is a problem somewhere. Have you installed SP1? If not you should. The 1000's of DeviceClasses entries isn't right and I don't know if it's related to the same bug.

    You could take a gamble and delete half of them to see if it speeds up Windows boot time, but you'd risk messing up your server if they were genuinely required. If you know how to backup your registry and restore it from the Windows recovery CMD prompt and can afford some downtime then it's worth giving it a try. (At your own risk of course)

    ReplyDelete