After listening to Greg Shields’ session at VMWorld on the top mistakes made when deploying HA and DRS I became inspired to create my own list focusing on Exchange 2010. This list and its numbered priorities are derived from my experiences, so for someone else number 6 could be a number 1 and so forth, especially since I threw some things based on virtualized environments. Don’t be shy and leave a comment if you feel I should add any other mistakes that you have seen in Exchange 2010 deployments. Without further ado here is the list. Hope this helps someone!
- Not accounting for DAG failure scenarios. A DAG design and setup is not as simple as just running through the wizard, replicating the databases and walking away. When setting up a DAG you must take into account more than just a server going offline. You must run through scenarios that include WAN link failures and other hiccups that can occur between sites. You should be thinking about creating a separate DAG if your passive site has active users. Link hiccups can cause those production databases to dismount to avoid split-brain syndrome. Also, understand the minimum amount of servers (votes) your DAG needs to maintain quorum. You do not want to be stuck in a scenario where databases are being dismounted unexpectedly or moving to other servers that they shouldn’t be. Make sure you understand how the DAG works and document all types of failure scenarios and failure domains.
- Assuming site failover is seamless. Performing what Microsoft calls a ‘Data Center Switchover’ which means activating databases in a DR location due to your production site going offline is not something you want to learn on the fly. You should be performing quarterly DR test or at least run through the steps a few times a year. Datacenter switchovers do require some manual intervention and it is best to be ready and have multiple people trained since the steps also require Powershell knowledge. Having a GUI centric admin on standby during this type of scenario can be a scary thing. A well-prepared playbook can have your databases up and running in a separate site within a minute.
- Replacing Backups with DAG’s. Contrary to what Microsoft says for 99% of all firms out there I do not believe having multiple copies of a database in different locations replaces a traditional backup strategy, even when taking advantage of lagged copies. Backups performed by Data Protection manager, Symantec Backup Exec, or Windows backup are a point-in-time snapshot of the database. I have been in scenarios where a client needs to restore a message from a year prior. Most lagged copies won’t be that old. Also, if you are in the scenario where all of your up to date servers are offline the last thing I want to do is to restore Exchange with days of information missing. The point is to have a proper backups strategy that backs up your databases every single night so you can sleep at night.
- Using memory over committing and swapping. When it comes to virtualizing Exchange 2010 make sure that you back up the vRAM you assign to the VM with enough underlying physical RAM on the ESX host. Also, make sure to set all reservations to unlimited. The last thing you want to do is have a mailbox server running off of a swap file. Count on many help desk calls and unhappy users if you do this. Exchange 2010 will eat up as much RAM as you can give it so make sure this is properly dialed in.
- Not utilizing MS Storage Calculator. Every Exchange architect out there needs to know about the “Mailbox Server Role Requirements Calculator”. This invaluable tool will help guide you on how many servers your environment requires. It will also guide you on HA, storage, network, and backup requirements. Doing this will allow you to properly size an initial project so you are not in the hot seat with the management on why Exchange is not performing as expected which then translates into additional money to fix the issues. No one wants to be put in this very uncomfortable position.
- Open Relay. Having an open relay allows anyone to connect to your Exchange server to send out mail. This puts you on the fast track to being blacklisted, which will result in a lot of painstaking work to undo. Exchange 2010 makes this a little harder to perform than previous versions. This has to be done with a Powershell command but you should always understand why you are implementing this. I like to have a separate connector for this that is well defined in the description so another administrator doesn’t come along and adds a subnet that was meant for a different type of connector.
- Placing Exchange DB’s and Logs on Thin Volumes. In my experience, thinly provisioned volumes are not a good match for a highly active application such as Exchange. Thin Provisioning allows you to assign a “Phantom Volume” to Windows. For example your storage admin carves you out a 50GB volume which you then mount up to Windows. Windows will now show the new 50GB volume but with no signs of what is actually taking place on the storage array side. Now, since the storage admin configured it as “Thin” the array is only allocating what is being used. As Exchange takes in more info and continues to write to the volume the storage has to “zero out” the block and reserve it before it can process the transaction. This can result in a big performance hit with Exchange. Make sure you tell your storage admin to not enable Thin Provisioning on the LUN he assigns to you.
- Not creating CAS Array from the get-go. CASArrays are designed to compliment load balanced CAS servers, which are now the MAPI endpoint for Outlook connections. Think of it as a virtual address that Outlook will use for the mailbox server connectivity. If one CAS server were to fail and your load balancer solution redirects incoming Outlook connections to the secondary CAS your Outlook clients are still pointing to that same virtual DNS name which is obviously still alive. Configuring a CASArray also makes things easier in a failover scenario where you can set a low TTL value in DNS and update it on the fly. In a datacenter switchover this provides a much quicker recovery time. Even if you don’t initially have a load balanced CAS solution you should take the few minutes to set this up since a messaging environment is not something the shrinks but continuous to grow. You never know what the future may hold. I have had firms tell me that they expect no growth for the years only to end up doubling in size. You will be the hero if you plan for this. It is much easier and less intrusive to set this up from the get go instead of worrying about Outlook profiles updating down the road.
- Letting Databases get too big. With Exchange 2010 Micrososft supports 2TB databases. This can however wreak havoc on backup and restore times. Tape restores typically take 2x the amount of time to restore vs backing up. Leaving databases at around 100GB or so will allow for quicker backup and restore times. Also, in case of a database outage having fewer users on a database will result in less people being down.
- Not paying attention to a few critical performance counters. Monitoring a few important performance counters can allow you to catch problems before the users notice them. If you find users complaining about Outlook bubbles complaining about connecting to Exchange or general slowness make your way over to “perfmon” as it can tell you exactly what’s going on and guide you on where the problem may lie. I did a previous blog post about the counters you should be paying attention to. http://3cvguy.blog.com/2011/06/26/exchange-2010-critical-performance-counters/
- Not assigning static IP’s for the DAG. By default the DAG will get assigned an IP through DHCP but it is best practice to assign a static address and put it into DHCP reservation if required. You don’t want to run into a situation where your DHCP get full and your DAG cannot pick up an IP address. With SP1 you can do this through the GUI.
- Separate Replication traffic. In larger Exchange environments you should be assigning separate physical NIC’s for MAPI and replication traffic. If you have everything on the same NIC you can disturb Outlook MAPI traffic especially if you are seeding a few database copies which can flood the NIC and also interrupt heartbeat traffic between the servers.
- Dedicate VMFS volumes exclusively for Exchange. When it comes to VMWare and Exchange it is always advisable to reserve VMFS volumes for Exchange only. This becomes much more prominent in medium to large Exchange deployments. Exchange is a very hungry application and you don’t want it fighting with another service for I/O.
- Not formatting VMDK’s in Eagerzerothick format. This coincides with #7 on my list where it is always best to zero out the volume when it is being formatted so that it doesn’t have to write the zeros during production. In the VMWare world this is called the “Eagerzerothick” format. It takes a little more time but will provide improved performance in the long run. (This is like unchecking “Quick Format” in Windows)
- Not loosening up DAG heartbeat thresholds when using vMotion. If your Exchange 2010 server will be participating in any vMotion migrations you should massage the cluster failover thresholds settings. When you vMotion a server there is a small moment where connectivity is a lost. This can trigger a DAG failover in some environments if you don’t up them a little. (cluster /prop SameSubnetDelay, cluster /prop CrossSubnetDelay, cluster /prop CrossSubnetThreshold, cluster /prop SameSubnetThreshold)
- Not utilizing Jetstress and Loadgen before production. Jetstress is a Microsoft tool that is designed to simulate the JET database I/O on your storage subsystem. This will help you understand what are the capacities and reliability of the DAS or SAN and whether it will fold under your Exchange load under production. Loadgen simulates client (Outlook, Activesync, OWA, POP, SMTP) workload against your server. You can configure options to simulate either online or cached mode with different versions of Outlook. As I stated earlier you want to make sure you architect the Exchange solution right the first time around. These tool are absolutely critical in any Exchange deployment and should never be skipped.
- DAG Latency. Microsoft states that site link latency for DAG’s should not exceed 500ms. Anything beyond that can result in a copy being lagged behind. Always make sure your network is able to support your environment.
So there you have it, the 17 biggest mistakes that I feel are made in Exchange 2010 deployments. Once again, please leave a comment with your experiences as I am sure there are many more that are just as important.