In “Building a Management Stack for Your Software Defined Data Center“, we presented a reference architecture for how one could build a new management stack for the Software Defined Data Center. In “SDDC Operations Management“, we discussed why the SDDC will require a new and different class of Operations Management solutions and laid out a process for selecting vendors to fill the Operations Management role.
The SDDC Management Stack Reference Architecture
The entire reference architecture diagram for the management stack of the SDDC is shown below. The balance of this post is about the Infrastructure Performance Management layer of the reference architecture. Future posts will address the layers of the architecture which are not covered in the linked articles above and this post.
Why is SDDC Infrastructure Performance Management Necessary?
In this post we discuss a new layer of management tools that will be required to properly manage your Software Defined Data Center (SDDC). That new layer is SDDC Infrastructure Performance Management (IPM). IPM for your SDDC is necessary for the following reasons:
- In the world of physical data centers where entire physical servers were dedicated to each workload, it was possible to 1) massively over-provision those servers so as to minimize the change that resource constraints would cause application performance problems, and 2) infer the performance of the applications from whether or not the applications were using normal amounts of resources.
- In the physical world, physical networks were provisioned (physical cables to individual servers) and massively over-provisioned switch capacity was thrown at the problem of ensuring that lack of network capacity did not impact application performance. Network performance was most often looked at from the perspective of how saturated the switches and routers were.
- In the physical world, very high-end SANs and very high-end and very expensive storage were often thrown at the problem of application performance in the hope that neither the SAN nor the storage would cause application performance issues. Storage performance was most often looked at in terms of the capacity of the array for IOPS.
In the SDDC, all of the above paradigms for measuring and understanding the performance of the infrastructure no longer work. They no longer work because:
- In existing data center virtualization environments, CPU and memory are abstracted from their physical manifestations and presented as virtual resources to workloads by the hypervisor. Virtual CPUs and virtual memory are not the same thing as their physical counterparts. The virtual resources are abstracted from their underlying physical counterparts and can be over-committed with respect to their physical counterparts. The degree of abstraction and sharing of these resources in just today’s data center virtualization environments means that you cannot infer anything about application performance by looking at either physical or virtual CPU and memory resource allocation.
- In the SDDC the network will be virtualized to the same extent as CPU and memory is today. This means that it will be possible to configure virtual network connections between guests across various local and wide area networks entirely in the network virtualization software. It means that the work of carrying the bits between the guests will be done entirely in software when the guests are in one host, and that the work will be shared between switches implemented in software and hardware switches when the connection spans physical hosts. This will make it impossible to infer how the network is impacting application performance by looking at how the resources on the switches are being utilized.
- In the SDDC, storage will be virtualized as well. This may turn out to mean several different things. At the minimum it will mean that more of the configuration of storage will be done in the SDDC software and less will be done in the arrays themselves. This alone will mean that it will be necessary to hold the SDDC software to account at least in part for the impact of storage performance upon application performance. It is also possible that various pools of storage (for example disks installed in servers) will be aggregated into pools of virtual storage managed by the SDDC. Since it will be SDDC software doing the pooling of the physical resources, it will again be necessary to hold the SDDC software to account for impacts to application performance caused by how storage is pooled.
In summary, the existing method of inferring the performance of the infrastructure by looking at how infrastructure resources are being utilized will not work in an SDDC. An entirely new approach is needed.
SDDC Infrastructure Performance Management Defined
In order to understand the performance of the hardware and software infrastructure the comprises a SDDC, we need to make one simple yet profound leap. We need to define infrastructure performance as end-to-end infrastructure latency and throughput. In other words, when it comes to understanding the performance of the SDDC we need to focus upon two numbers. One is throughput, which is how much work the SDDC is being asked to do. Throughput can be measured in a variety of ways, including calls per second, bytes per second, or IOPS. Latency is how long it takes for the infrastructure to response to a request for work for a workload. Think of that as an infrastructure transaction that might start with a web server making a call to a Java server, a Java server making a call to a database server, and a database server making a call to storage, and think of the round trip time that this infrastructure transaction takes. In addition to measuring infrastructure latency, an IPM solution must also meet the following criteria:
- End to end infrastructure latency is not available from standard management API’s like SNMP, WMI and SMIS. Therefore in order to be an IPM product, the product must collect its own infrastructure latency data.
- Infrastructure data should be collected in as near a real time basis as possible. For the purpose of this post, every one second would be defined as near real time.
- Consuming the storage latency data from the vSphere API does not make a product into a IPM solution because that data is only available once ever five minutes.
- Latency data should be collected in as comprehensive a manner as possible. That means not missing any anomalies or spikes in the data. This is another area where just consuming the storage latency data from the vSphere API falls short as that data is an average of 15 data points (three times a minute for five minutes) which means that peaks in latency are obscured by the averaging process.
- Latency data should be collected a deterministic manner. That means that the reported latency number should be an actual measurement of an actual latency, and not an average or an estimate of a latency measurement.
Will VMware Provide an Infrastructure Performance Management Solution?
When VMware announced its network virtualization offering, NSX, Bruce Davie, one of the senior architects on the VMware networking team (and a former senior architect from Nicira) posted on the CTO blog “Open Source, Open Interfaces, and Open Networking“. That post contained the following comment:
“Statistics Collection & Telemetry
Another area of focus for an open networking ecosystem should be defining a framework for common storage and query of real time and historical performance data and statistics gathered from all devices and functional blocks participating in the network. This is an area that doesn’t exist today. Similar to Quantum, the framework should provide for vendor specific extensions and plug-ins. For example, a fabric vendor might be able to provide telemetry for fabric link utilization, failure events and the hosts affected, and supply a plug-in for a Tool vendor to query that data and subscribe to network events.”
This can be read as an intention on the part of VMware to use the virtualized network layer as a way to collect real time latency information on behalf of all of the workloads in the SDDC. This is a huge step forward for both the SDDC and for VMware as a performance management vendor.
Vendors Providing Infrastructure Performance Management Solutions Today
Since the SDDC does not exist yet, SDDC Infrastructure Performance Management cannot exist yet either. But there are several vendors who are already focusing upon latency and throughput as the key metrics of physical and virtual data center performance. Those vendors are listed in the table below. The one thing that they have in common is that they have all invested heavily in collecting real time latency from various aspects of the infrastructure and in building management products that can cope with the flood of data that results from real-time instrumentation.
The IPM Innovators
The set of vendors in the Infrastructure Performance Management business and a brief descriptions of their products are in the table below. Subsequent posts will go into these products in a great more detail.
|Vendor||Product||Focus of the Product||Data Collection Methods||Deployment Model|
|Dell/Quest Software||vFoglight Storage||Monitoring of Infrastructure Response Time from the application in the VM to the spindle of the array and back again.||Specific integration with the CLI of the storage array, SMIS, and the vCenter API data.||Deployed as a virtual appliance in a VMware environment.|
|ExtraHop Networks||ExtraHop||Infrastructure Response Time is collected End-to-End (from Guest to Spindle on Storage Array) for network attached storage.||Physical appliance on physical mirror ports of switches, virtual appliance on virtual mirror ports of vSwitches in the VMware environment.||Deployed as a physical appliance attached to a mirror port on a physical switch, and/or a virtual appliance attached to a virtual mirror port on the vSphere vSwitch.|
|GigaMon||GigaVUE-VM||Infrastructure Response Time is collected for each application identified via port and protocol from the guest through the entire IP network (LAN, WAN, and IP Storage).||Physical and virtual taps into the IP infrastructure.||Deployed as one virtual appliance on the vSwitch in each VMware host, physical appliances on the physical mirror ports on the LAN switches and one management appliance.|
|Riverbed||Cascade||Monitoring of the virtual VXLAN network.||Instrumentation of the new VXLAN interface in vSphere 5.1.||Deployed as a virtual appliance in a VMware environment.|
|Virtual Instruments||Virtual Wisdom||Measures the Response Times of individual Fiber Channel Frames and maps this to LUNs||vCenter APIs, FC switch polling through SNMP, proprietary taps into SAN fabric.||Deployed as a physical TAP on the Fiber Channel SAN, plus standards-based interfaces to switches and vCenter.|
|Xangati||VI Dashboard||Infrastructure Response Time is collected through IP SLA technology and Xangati’s Remote Object Viewer.||vCenter APIs, external directories like DNS and Active Directory, NetFlow data from physical and virtual switches/routers.||Deployed as a Virtual Appliance in each VMware host and a separate virtual appliance for management dashboard.|
As the first SDDC is delivered later this year, infrastructure performance management solutions will be an essential part of SDDC performance management. The good news is that a robust set already exists of vendors who can readily enhance their offerings to address the incremental requirements of the SDDC.