Over the past several months we have been working closely with a client here in Manhattan who has outgrown their on-premise IT infrastructure. They are a very successful Hedge Fund that despite the current economic situation has grown very rapidly over the last few years. They are 100% virtualized with about 130 virtual servers and 100 virtual desktops. Keep in mind that since this is a fast paced, high growth Hedge Fund the overall workload of these 100 users can easily generate more horsepower than a firm with 1000+ users. Scaling the on-premise IT Infrastructure has been quite the challenge. When you grow to a certain point it can become very costly and risky to keep your server room in the middle of Manhattan. Many buildings will only give you certain amount of power to your server room with no exceptions and the overall power infrastructure can be very unstable. At the other end of the spectrum it also doesn’t make much sense to keep server in an office space where you are spending $2000+ a square foot.
With the above factors being taken into account we decided to move the client into a Private Cloud/Co-location infrastructure. There are many moving parts that go into the success factors of such an involved migration. The design includes upgrades to the low latency, high bandwidth networking, storage, servers, applications, and disaster recovery to name a few. However, this post will focus directly on the storage aspect. My client has been running on an low level EMC CLARiiON CX series SAN. Options such as FAST and FAST Cache have been added over the years to help keep up with the I/O requirements of the firm. Unfortunately, the SAN could no longer keep up with the heavy unpredictable virtualized workload and since the users were working off virtual desktops the pain was felt on a daily basis. Originally the firm was going to stick with EMC and either upgrade to a VNX 5500 or 5700. No doubt, the VNX 5000 series are very capable arrays that can hold up to 500 spindles and handle a large amount of I/O. I had our client start looking at the 3Par arrays for many reasons. First, the 3Par’s are specifically designed from the ground up to handle the type of mixed/unpredictable workload that is generated from my client’s virtualized infrastructure. The 3Par systems are famously known for the way blocks of data are placed on the disks. The virtual volumes on the array are “wide-stripped” across all “like” disk spindles which provides a MASSIVE amount of horsepower. All this is done on a hardware level via the purpose built Gen 4 ASIC. What this means is that 3Par specifically designed a processor with instructions sets that dedicates itself for many of the storage operations that other vendors would normally offload to a non purpose built off the shelf chip. There are a lot more benefits in this area that I will explain in a separate series of posts that I will finish in the near future.
Our client ended up going with our recommendation of an HP/3Par P10000, two of them actually :) I did a write-up of the new P10000 specifications in another post. We specifically went with a v400 model with four, yes FOUR controllers! The client demands 100% uptime with no tolerance for slow downs. One thing that I generally dislike about arrays is the risk the comes along with upgrade the operating environment or firmware. The majority of mid-range SAN’s including the VNX series can only house two storage processors or controllers. This means that when you either upgrade one of those controllers or if one of them fails the array will disable cache functionality on the operational controller all read/write operations will go directly to disk. This is a built in safety measure since a failure of the last surviving controller with actual data sitting in cache will result in some major data loss. Working on one controller is bad enough but when you also throw working without caching operations a serious degrade in performance of the entire infrastructure will occur. As stated the 3Par we chose has four controllers with a feature called “Persistent Cache”. Not only do we have four meshed active controllers eating away at the I/O operations but we also have the PC feature which protects us if a controller should fail or during an upgrade operation. PC will rapidly re-mirror cache to the other nodes in the cluster when a controller failure occurs. If you have ever been through an EMC upgrade you know that it can take a ton of time and I know of several cases where an upgrade process rendered a controller useless. This, of course can happen on any array. . This was a huge selling point to the client and it allows the administrators to be more at ease during an upgrade.
In Part 2 of this blog post I will go over some more specifications and the physical assembly process of the entire array.
-Justin Vashisht (3cVguy)