Even though Exchange 2010 has been out for a while I have seen very little documentation on the Internet that easily walks you through the process of a “hole in the ground” data center failure scenario. I think that most people have not had to live through this type of scenario. Since I mainly work with clients in the financial industry, these types of recovery tests have to be performed every quarter for due diligence reasons. I also perform it before we start placing users on a 2010 infrastructure as part of regression testing before. The Technet articles are great but they can be a little wordy for the person that just wants a step-by-step “how to” guide. So, I will outline in a few easy steps about how to get through this type of scenario.
ORGANIZATION DETAILS – 3cV Inc. has two sites – (Site A – New Jersey) & (Site B – DR (Disaster Recovery) with bi-directional communication via a WAN link. New Jersey is the production site and DR is obviously our disaster recovery location where all data gets replicated to. One Exchange 2010 server with all roles exists in both locations and both have their respective topologies configured in Active Directory Sites & Services. My DAG is configured in DAC mode. (If your cluster is not DAC enabled, it should be!)
Site A:
3cV-NJ-EX01 – Exchange 2010 – Database – 3cV-NJ-MBXDB01
3cV-NJ-FSDC01 – Domain Controller
Site B:
3cV-DR-EX01 – Exchange 2010 – Houses passive copy of 3cV-NJ-MBXDB01
3cV-DR-FSDC01 – Domain Controller
SCENERIO – A blizzard over New Jersey has knocked out power to the data center and all servers are completely offline. Users are working remotely and Outlook is not starting due to the Exchange being unreachable. Since there is no timeframe of a power restore 3cV, Inc. management has made the decision to activate the disaster recovery site. Management wants e-mail to be restored before anything else. The picture below shows that we are remotely connected into the DR server (3cV-DR-EX01) which shows that NJ servers are completely unreachable and the Exchange database; 3cV-NJ-MBXDB01 is offline.
ACTIVATING THE DR DATACENTER
1. Stop the cluster service on all mailbox servers in the DR site. In this case only one server exists in the DR site so I will go ahead and stop the cluster service on 3cV-DR-EX01.
2. Next, we have to change the configuration so the DAG knows to ignore the NJ site and put the respective servers in a stopped state.
Stop-DatabaseAvailabilityGroup -Identity 3cV-DAG01 –Activedirectorysite “3cV-NJ” ConfigurationOnly
3. Now, we have to activate the DR DAG.
Restore-DatabaseAvailabilityGroup -Identity 3cV-DAG01 –Activedirectorysite "3cV-DR" –AlternateWitnessServer 3cV-DR-SQL01 –AlternateWitnessDirectory c:\FSW\3cV-DAG01.3cVguy.local
4. Databases should now mount up on the DR server.
5. You should now change your DNS records to reflect the DR site being active. (CASArray, Autodiscover, and OWA) Once this is completed all users should now be able to connect to the DR site without having to touch any configuration in Outlook. (Time will vary depending on your internal/external DNS TTL settings)
RESTORING THE PRIMARY DATACENTER
The power has been restored in the primary datacenter and all servers are back online. We have given Active Directory a chance to catch up on replication. DAC mode was originally configured on the DAG so that prevented any split brain syndromes from occurring when the NJ Exchange server powered on. The below picture shows that all NJ servers are now reachable from the DR datacenter.
1. We have to tell the DAG that the production site is back online and the stopped servers should now be started.
Start-DatabaseAvailabilityGroup –Id 3cV-DAG01 –ActiveDirectorySite 3cV-NJ
2. Going back to the Exchange console you should refresh the Database Management view. After a few seconds you will see that the NJ database is now the Healthy replica and all of the changes that occurred at the DR site will now replicate back to the NJ database. (If the replica says “Failed” instead of healthy give it a few minutes and try refreshing again.) Make sure the “Copy Queue Length” & “Replay Queue Length” are both at “0”.
3. Now when you are ready you can move the database back to the production site, in my case New Jersey (3cV-NJ-EX01) by using the following command.
Move-ActiveMailboxDatabase 3cV-NJ-MBXDB01 –ActivateOnServer 3cV-NJ-EX01
The picture shows that the database is now active in NJ.
4. Now you should change the DNS records that we changed before to point back to the production site. Once again these records may vary depending on your configuration and if you have split brain DNS configured. Your users should now be able to connect to the primary datacenter once the records propagate.
5. I like to move the Cluster Core Resource (PAM) role back to the primary site through the command prompt.
cluster.exe 3cV-DAG01 group "cluster group" /moveto:3cV-NJ-EX01
There are many other failure scenarios such as dealing with a partially terminated production datacenter that call for slightly different steps to be taken. Make sure you understand and document and test all possible failure scenarios that can occur in your particular environment.