Almost every enterprise that I have spoken to about their experiences in virtualizing anything more than simple or tactical applications has come across one or more that did not perform well once virtualized. In most cases these were applications that used low amounts of CPU and reasonable and predictable amounts of memory, so it stood to reason that resource conflicts in these areas were not the cause of the performance issues. Most of these enterprises had the network support for the virtualized hosts dramatically over-configured with multiple-teamed NIC’s and redundant HBA’s, so there was every reason to believe that the network was most likely not the issue either.
That leaves us with the most likely culprit, the storage networking and physical storage layers of the virtual infrastructure. There are a wide variety of problems that can occur in these layers, all of which can create serious performance problems for applications and users. Trying to avoid these problems with a lack of tools typically causes enterprises to over-provison, which adds large amounts of cost and still often does not address the issues. Here are some of the top issues that frequently occur in the SAN and storage layers:
- Lack of visibility into the I/O flow and the I/O performance between virtualized servers, the supporting SAN and the supporting storage arrays. While there are quite a few tools that focus upon monitoring performance of the virtual infrastructure that collect the VMware vCenter Server data and perhaps some data from the guests via WMI, very few of these tools provide a commensurate level of visibility into the SAN and storage layers.
- Understanding the mapping of guests, to SAN ports, to LUN’s to physical arrays and spindles. The storage infrastructure is often a vast shared resource with the virtual infrastructure as just one of its users. This is a particular problem in the virtualized world, as the server layer (in the form of guests) is by design highly dynamic and subject to frequent manual or automated administrative actions (VMware VMotion, DRS, and the creation of new guests). Therefore this mapping needs to be discovered, and then on a fairly frequent basis re-discovered to capture changes since the last discovery.
- Collection of data in a scalable and non-intrusive manner. The SAN and the underlying storage layer are subject to massive data flows in response to application workloads. It is a massive technical challenge to be able to collect data from these infrastructure components in a manner that is comprehensive, frequent, and scalable up to very large systems with having the monitoring process itself introduce performance issues into the equation.
- Understanding configuration based I/O Operation conflicts. How many enterprises do you know that have accidentally mapped the LUN supporting the indexes for an Oracle database to the same physical spindle as the LUN that supports and Exchange data store? How many enterprises do you know that actually have no idea what the mapping is of the high demand applications in their environment to the underlying SAN ports, LUN’s and spindles?
- Did those VMware DRS based VMotion’s solve the problem or create the problem? VMware DRS will perform VMotion’s of guests based upon CPU and memory resource contention in the pools of server resources where DRS is turned on. However, VMotions (especially multiple of them at the same time) often cause severe I/O contention. For these reasons, some people even recommend not use DRS for business critical transactional applications. Even if you take this step, that still does not prevent DRS from doing a VMotion of a guest in a way that might create a SAN bottleneck, since DRS is unaware of which servers use which SAN ports.
There are two vendors that specialize in addressing these issues. Both of these vendors collect extremely unique data about the SAN and/or storage layer of the virtual infrastructure and combine that data with the commodity VMware vCenter Server data that can be collected from the VMware APIs. Let’s take a look at each of them.
There are many solutions that focus upon helping enterprises manage the capacity of the resources on a per resource pool basis. However, most of these solutions focus upon CPU and memory capacity, and are unaware of the SAN and storage implication of their capacity recommendations.
Akorri is able map how the guests are configured with respect to the LUNs and arrays, and more importantly understand the load and the infrastructure response time that arrays are providing to their upstream guests and applications. Akorri is therefore the only product in the virtualization performance management space that can tell you that a spindle is providing slow responses to I/O requests, and what guests/applications in which host are impacted by this slow response time.
Akorri buids upon this unique view of the configuration and performance of virtualized systems with sophisticated pre-packaged analytics. The product calculates an Infrastructure Response Time (IRT) metric which is a composite of how the entire infrastructure is responding to I/O requests from each Guest and each major application in the Guests. This is complemented by a Performance Index (PI) that determines from a CPU, Memory and I/O perspective where an application is on the trade-off curve of performance and capacity utilization. IRT is a major step forward toward allowing the IT Operations group to actually ensure performance of business critical applications, and PI actually allows IT Operations to make intelligent trade-offs in terms of the investment in additional hardware against required performance (and eliminate a great deal of over-provisioning which in turn often pays for the purchase of the Akorri solution.
Virtual Instruments (VI) is a truly unique vendor, with a unique product that comes with a unique heritage. Virtual Instruments is focused upon monitoring the SAN (the actual Storage Area NETWORK). VI is a spinout from Finisar the company that makes the optical transceivers that go into Brocade and Cisco SAN Switches. This gives the company deep technical resources and credibility in terms of how the SAN actually works, and how to monitor it in a scalable and non-intrusive manner.
The VI product, VirtualWisdom works by putting a tap into the fiber optic cables that run from the SAN switches to the storage arrays (remember the Finisar legacy that gives the company the technical chops to pull this off). This allows VirtualWisdom to see every last bit of data that runs through these links. This is the first aspect of the product that is so unique. There are very few monitoring products in the world that do not sample at some level. Most products are simply not built to handle the volume of data that would come with comprehensive deterministic data collection, nor could they do so without creating performance problems for the target environment that is being monitored. However, since VirtualWisdom is attached to a read-only tap on the fiber optic network of the SAN, it is able to collect every bit of data that flows over that link. This means that VirtualWisdom is one of the very few products that cannot miss an issue because it happened to occur in between a 5 minute or even (in the case of some products) a one hour sampling interval.
By analyzing the frames that comprise the SAN data, and combining this data with the VMware vCenter Server data, VirtualWisdom is able to tie load problems and response time problems from guests in the virtual infrastructure to the target LUN’s. The measurement point used by the product allows the product to scale up to the largest and most sophisticated environments (the company boasts some of the largest and most sophisticated data centers in the world as customers). There is also a very nice hard dollar ROI story for Virtual Wisdom. It turns out that because most enterprises do not know how utilized their SAN ports are, and which ones are either heavily utilized or not, that most enterprises dramatically over-provision SAN ports. It is often the case that Virtual Wisdom pays for itself for an enterprise that is growing its storage network, since the company can recommend a dramatic reduction in expensive SAN ports on a forward going basis (and provide the data to back up that recommendation).
There is one other factor that will come into play this year as VMware Sphere rolls out and as monitoring vendors start to take advantage of some of the additional unique APIs in VSphere. There are opportunities in the vStorage and VMsafe APIs for monitoring vendors to collect data about the I/O layer of the virtual infrastructure in a highly leveraged and scalable manner. Of course this will require that the storage vendors deliver the vStorage plug-ins for their arrays before the vStorage and VMsafe APIs can be used to collect this data, so this may take a year or so to play out. So while this new interface holds great long term promise for enterprises seeking to optimize virtual infrastructure performance and capacity, companies like Akorri and Virtual Instruments have the solution now.