Over the past few weeks I have been involved in a VMWare consolidation project for a client of mine who’s datacenter has grown to over 100 physical servers. As expected, the client is ecstatic with the results of ESXi5 and the rate at which they have been decommissioning physical servers. For disaster recovery purposes I recommended Site Recovery Manager 5 for virtual machine failover. Several new features appealed to them very much, one being the new “host based replication” which allows you to replicate a VM that is located on local storage. Since the client has some ESXi servers that are not hooked up to a SAN this was very appealing to them. Sure, there are 3rd party applications from Vizioncore and Veeam that have offered this for a while but having the ability to stick with VMWare as much as possible was a big incentive. SRM are more attractive to SMB’s due to the refined interface and integration and pricing that is easier to swallow.
The ability to “test” a recovery plan was the tipping point in closing the SRM5 deal from a business case standpoint. A “Recovery Plan” is essentially your pre-configured playbook for failing over a single or group of virtual machines. In a Recovery Plan you can dictate many things including changing the IP address of a VM upon failover, VM dependencies, Pre/Post Power on steps, etc. Inside a Recovery Plan sits the “Protection Group”. PG’s are a placeholder for groups of Virtual Machines and what form of replication is being used to ship them over to the recovery site (vSphere Replication or Array Based Replication).
So, here’s the scenario. Your CTO comes to you and states that management is requiring the IT department to immediately perform a disaster recovery test on several important servers. The major challenge here is that the business group in charge of these machines state that a maintenance window cannot be established for a several weeks due to the overload of work from year end processing. Of course, IT is once again stuck in between the politics. You inform your CTO that since your firm has adopted the use of virtualization and SRM you now have the ability to perform a “test” failover of said Virtual Machines. In the recovery plan we have configured SRM to create a private vSwitch, therefore creating a “bubble network” which will prevent the replicated machines from disturbing the production servers! With all of these features we don’t have to wait for a maintenance window and will be performing the test on a Monday afternoon. Here we go.
-Production Site -> New York -Disaster Recovery -> New Jersey -Standalone vCenter server in each site. -In this case the VM’s are replicated to New Jersey via the new vSphere replication with SRM5.1. Below you can see both vCenter server instances. On the right we have the production site and on the left is the DR site. I had to remove some references to keep the client identity private. As you can see on the right, the machine in question “TTNAPPS” is on and in use by users. If you follow the arrow to the left you will see the reference for the replicated machine in the NJ vCenter server.
2. Under the SRM “Recovery Group” setting for TTNAPPS you can see the IP Customization screen. I have preconfigured the machine to change IP’s to the DR subnet upon failover. (Note: To cut down recovery time over previous versions SRM 5 does not use sysprep or customization specs, instead networking information is injected through a VIX API call pushed through VMware Tools in the VM. This results in a much quicker process than using the conventional sysprep method)
3. Here we are in our proof of concept Recovery Plan that has one VM in it – TTNAPPS. We can go ahead and hit test move forward with the failover. Protected and recovery sites look good. We checked the box to make sure the passive TTNAPPS VM in the DR site is up to date by replicating the delta since the last sync. Hit next.
4. On the next screen we can go ahead and hit start to commence.
5. Less than 2 minutes later you can see that “TTNAPPS” is now online at the DR site and still up and running in the production site.
IP address changed automatically as we planned.
New vSwitch created automatically in DR site to create our isolated network test bubble. TTNAPPS automatically added to it! You can obviously add an uplink to connect into the rest of your network with a VLAN that is dedicated to your bubble network.
6. Once your DR test is completed you can go back to your recovery plan and hit “Cleanup” to reverse the entire process. This will put everything back to the way it was.
All metadata is now cleaned up and the DR server is back into passive state.
As you can see SRM5 is an incredibly powerful weapon to have in your arsenal. In this article we have only scratched the surface with the capabilities of SRM 5. The ability to test your disaster recovery plan while leaving your production machines intact is something every firm can benefit from. Remember, we did this without the use of a SAN. The SMB market can now take advantage of the features that were only available to firms with big IT budgets.