Catching up a bit on other people’s posting and wanted to share this one from Guillaume Fourrat.
Today I’m going to highlight one of the changes brought by SQL Server 2012 SP2, which is the way we handle VSS Backup requests on AlwaysOn Secondary Databases.
Until now, any request for a FULL database backup (VSS_BT_FULL) thru VSS against a DB that is an AlwaysOn secondary was failing by design. Our VSS Writer SQLWriter would return FAILED_AT_PREPARE_SNAPSHOT (0x800423f4 – VSS_E_WRITERERROR_NONRETRYABLE).
A copy-only VSS backup (VSS_BT_COPY) would work.
The rationale being the following: a FULL backup is actually updating the target DB (reset of differential bitmap mainly), which is not possible when the DB is read only. Furthermore, because of the failover possibilities introduced by AlwaysOn, the favored option was to use Native SQL Servers backup that could rely on SQL Server variable backup location (http://msdn.microsoft.com/en-us/library/hh245119.aspx) if needed, and be ‘alwayson –aware’.
So that could be the end of the story: against an AlwaysOn Secondary DB, either use Copy_only VSS backups or use native backups.
But of course that wouldn’t make for a very interesting blog post…
Consider the following scenario:
Large Windows HyperV Servers, hosting many HyperV virtual Servers, some of them SQL Servers in Always On architecture.
In short: a Private Cloud.
In this context, the IT usually takes care of the infrastructure at host level, and lets users deal with whatever happens within the VMs. One of the key tasks of IT is to manage backups (eg. for disaster recovery at datacenter level, or to provide restore of single VMs).
And the mainstream way to do that is to take VSS backups of the Host Disk Volumes. Microsoft System Center DPM will do exactly that.
But VSS backups are all about taking backups that are consistent: in ‘standalone’ SQL Server context you may already know all the logic SQLWriter implements to make sure that IO against the Databases that are backed up are frozen during the snapshot operation. So, back to our HyperV context, collecting a point-in-time image of a VHD without bothering with what happens within the VM would be defeating that very purpose right?
So what happens is the following: the VSS backup is propagated to Guest VMs thru HyperV integration services. That propagation hardcodes the backup type to VSS_BT_FULL, and therefore all guest VMs are initiating a VSS backup/snapshot in their own context. The purpose is to make sure that all applications are quiesced within all running VMs at the time we take the snapshot at the host level. This will enable us to generate a consistent backup of running VMs.
But let’s now put this in the context where one of the VMs is running an AlwaysOn secondary DB: you guessed it, it’s not going to work:
The important thing to know here is that the error returned by SQLWriter in VM3 will actually bubble up all the way to the initial VSS backup command at Host level, and will make it fail as a whole.
So we ended up in a situation where the IT infrastructure folks would see their Host backups failing from time to time for an unknown reason, depending on whether one or more of the VM present on the Host Drive being backup up had a secondary AlwaysOn DB! It could be that the AlwaysOn AG spans different HyperV hosts and therefore that the presence of a Secondary DB on a given Host is not something static over time.
Because of the complexity of the whole call chain, and because infrastructure IT operators may not have any visibility (or understanding) of the VM content, you can imagine what kind of troubleshooting challenges this would offer… And even when the situation is understood, well, what do we do? If host level backup must be somehow manually synchronized to the applicative state of Guest VMs, the Private Cloud scenario becomes quite more complicated all of a sudden.
This is the reason why SQL Server 2012 SP2 ships a code change for SQLWriter that will implement the following:
As you can see, SQLWriter now detects this specific situation and changes the backup type to VSS_BT_COPYONLY. This will only happen for VSS_BT_FULL backups against AlwaysOn secondary Dbs. VSS_BT_FULL backups against primary DB will happen without change.
In this case, the VSS backup will now successfully complete in VM3 and the host-level backup success state will no longer be tied to guest VM’s AlwaysOn activity. Private Cloud scenario unlocked!
Important note: the fact that VSS backup of AlwaysOn secondaries now works does not make it the preferred solution to backup SQL Server AlwaysOn architectures. The main purpose of the SP2 change is to avoid a situation where a single SQL Server in a VM fails a complete host-level backup operation that encompassing dozens of VMs.
The resulting backup for the VM hosting SQL should be considered a Disaster Recovery one, where AlwaysOn will be removed at once at restore time, not as a way to rebuild a subset of the nodes for an AlwaysOn Availability group. And for regular databases within the VM, that backup is as good any regular VSS one.
Finally, SQL Server 2012 SP2 only contains a partial fix for this issue. Servers running case-sensitive sort orders will require SQL Server 2012 SP2 Cumulative Update 2.
SQL Server Escalation Engineer
I really like the post for the reasons listed below:
- It is not something about a stupid query that can not be written a different way as it is embedded in the code or generated via an ORM Tool ( Entity Framework, Hibernate )
- That is, it can be changed and optimized
- Furthermore, we are not handicapped by what the developer thinks is fanciful and should be upheld by all means
- It affects a whole lot of parties
- 3rd Party tools vendors that utilize VSS to backup SQL Server databases
- Corporations that do not know or care what VSS is or does, they do not have a DBA and could care less that sql server datafiles can not just be backed up like other files
- The Application logs an error
- The application actually logs the error and it is the hope that support staff will capture & review the errors periodically
- MSFT can and has provided a fix
- MSFT has provided a fix and that means they are aware and tracking it
- The patch will be brought into the main line code and all future versions will benefit going forward
- If it breaks in subsequent version, we have every right to open a ticket thru Corporate Support or as a Connect Item
- And, above all else this is an actual technical problem
- It is not a process problem
- It is not a people problem
- It is not one handicapped by long institutional or inter-department slight or grievance