Virtualization Performance Management – What If We Started Over?

Javier Soltero (the Founder of Hyperic) and now the CTO of the SaaS and Application Services Group at VMware just put up a very interesting post called The GreenField Enterprise, which is about “How would we do it if we could start over”.  While there are probably no enterprises that can actually start over from scratch this is an important analytical exercise to go through since it can identify the right way to go about things without having those decisions being encumbered by the legacy of what we have done before.

Javier’s post covers Infrastructure, Applications, Services, Security and Management. For the balance of this post we will focus upon the relationship between Infrastructure and Management of that Infrastructure. So first of all, what would our next generation infrastructure look like if we had a clean sheet of paper? We would like to propose that it would adhere to the following principles:

  • That our infrastructure would be organized into horizontal layers with truly standard and open interfaces between the layers. Those layers would probably very much mirror the layers that we have today – storage, storage networking, local area networking (or perhaps the previous two converged as is the current trend), compute (CPU and memory), a user access network, and finally end user devices. If the infrastructure in fact goes in this direction this would be a direct repudiation of everything that Oracle is doing.
  • That due to this more formal layering that it would be much easier to replace one product from vendor A with another product from vendor B within each layer of the stack. Vendors will of course resist this since they want to make it hard for their products to be replaced – but the trend in this direction is already strong and likely irreversible.
  • This infrastructure should be much more self-instrumenting than is our current infrastructure. The infrastructure would be largely responsible itself for collecting load, utilization,  response time or latency information on every request that is placed on each component of the infrastructure. The infrastructure would also be responsible for publishing this stream of information to management products that subscribe to the stream, eliminating the complexities and overhead of having tools poll the infrastructure for metrics. This is necessary because as workloads become more dynamic and the rate of change in how workloads increases (due to virtualization) it will become increasingly difficult for bolt on after the fact instrumentation to collect the required level of data about performance often enough to guarantee infrastructure and applications performance.
  • That this infrastructure and its self-instrumentation is at is core multi-tenant. In other words, the infrastructure needs to be able to be aware of who the “customer” is for each CPU execution cycle, each memory read or write, each I/O request, each packet sent over a network, and each read or write from or to the storage array. As a result of this multi-tenant aware self-instrumentation it will be possible for the infrastructure to not only report to each individual customer what the latency is for that customers’ workloads, but also self-configure to guarantee that latency (or infrastructure response time) SLA’s are met for specified business critical workloads. Note that really demanding and sensitive enterprise workloads are probably not going to go into public clouds until this issue and secure multi-tenancy are resolved.

Implications for the Infrastructure

If indeed the infrastructure for computing evolves in the directions above then this will mean a wholesale reinvention of every hardware product currently in use in the data center. Some potential changes might be:

  • Storage might actually get commoditized. It might actually be possible to go down to Fry’s and get a disk and plug it into your EMC array and have it work. There would be a standard way to get mappings of workloads to LUN’s and spindles along with I/O latency information out of all of the arrays. The current practices on the part of some leading storage vendors (EMC) to make this information as hard to get as possible would be abandoned.
  • The really expensive SAN infrastructure (FiberChannel) that is predominately in use today might get replaced with something that is as ubiquitous, inexpensive, and open as Ethernet (FCoE is a great bit of progress down this path, but we still have far to go on the openness and inexpensive parts). A really good start down this path would be to have the next generation of Ethernet support guaranteed delivery of packets so that one common physical LAN infrastructure could be used for everything from server to disk arrays (no more HBA’s anywhere). On the instrumentation front, taps would be built into the network at each layer allowing for easy capture of transaction level data on the network.
  • The relationship between the compute resources and the network might totally change again. The recent trend has been to move more of the network (a virtual switch) into the compute servers and essentially out of the hardware switches themselves. This trend really goes against all of the larger and longer term trends in networking which has been for the network to get smarter, and for the network to become more self-managing. It might just turn out that the right thing to do is to move the virtual networking layer that is in vSphere back into the switches so that virtual switching can be provided by switch vendors to any vendor that has a virtualization platform running on the upstream servers.
  • If the networking moves back into the switches then servers become much simpler devices that are essentially just compute and memory resources for workloads. This would be the opposite of a Cisco UCS.
  • The infrastructure itself must become more aware of applications, their topology, their identity, and their performance. A simple and standard method to identify applications and their topologies would be provided by all of the layers of the infrastructure via standard interfaces to monitoring tools. This data would also be directly linked to the associated latency information for each application. In other words the infrastructure would be able to tell any tool that asked what applications were running on it, the topology of those applications at any moment in time, and the end-to-end infrastructure latency for each application.
  • The software running in servers should be as simple as possible and as thin as possible. Unnecessary layers of software running in hosts and guests should be eliminated. Pursuant to the two points directly above, it really does not make sense to steal compute cycles from applications to do network switching in servers. Maybe the answer is a hardware switch blade for every N compute blades in a system, or maybe the answer is to move the switching back out into the switches themselves. There are also tremendous opportunities to remove layers of software from virtual machines. Anti-virus and backup agents have already been removed. Most companies remove legacy monitoring agents, and replace that functionality with feeding vCenter data to management tools. In the case of applications written to frameworks like Ruby, Java and .Net is it possible to remove most of the operating system from the guests.

Implications for Management (Monitoring) Software

Management (monitoring) software today focuses a tremendous amount of energy (and code) on actually collecting data in a reliable manner. In very few cases are there third party management solutions able to collect data in the kind of real-time, comprehensive and deterministic manner described above. If the infrastructure becomes self-instrumenting and can collect its own utilization and latency information in a real-time, comprehensive and deterministic manner and publish that via a standard interface then the entire management tool industry might change along the following lines:

  • Tool vendors will spend dramatically less time, money, and coding effort on collecting data
  • Tool vendors will be able to (and will have to) shift their focus to the interpretation of data for various purposes.  If real infrastructure performance data becomes as readily available as WMI data is from Windows servers, then tool vendors will have to climb up the value chain and instead of competing on what they monitor (collect data from) and whose devices they support, focus upon what value in terms of analysis they can add to the data. This will place a premium upon analytics of various types.
  • Tools would then segment into the categories of vendors that use the available data for various purposes. Capacity Management (making sure that momentary and projected near term capacity bottlenecks do not occur), Performance Management (monitoring the response time of the infrastructure and the applications in a real time and continuous manner , Performance Assurance (guaranteeing the performance of key applications by organizing resources on behalf of these applications), and Chargeback (how many transactions at what level of required response time did you run) are all logical management applications which could be layered on this data.

All of which leads us directly to VMware and Integrien.  The actions of VMware in this regard are going to be very interesting to watch. VMware has been better about building instrumentation into its infrastructure earlier in its life than has any previous systems software vendor. VMware’s vSphere, vCloud, and vFabric products sit at very interesting places in the compute stack – places that give VMware software the ability to see things in a generic manner that no other systems software vendor can see across that range of environments. VMware already collects enormous amounts of data about its own platform in its own platform (and actually does not even expose most of it). The existence of these real time data feeds in the VMware vSphere platform combined with the Integrien acquisition may in fact mean that VMware has already figured this out and is going in this direction. If so, this means that VMware has a very disruptive strategy in mind for its own set of management tools and is not going to simply reinvent what the third party ecosystem has been offering for some time. VMware may simply choose to aggressively instrument its own layers of software infrastructure and point these real time data feeds at Integrien and let Integrien figure out what is good and what is bad at each layer of the stack.


If we are going to start over, why not really start over and reinvent the entire infrastructure and management software industries in the process. That way we end up with an infrastructure that was actually designed for the dynamic, agile, scalable, and multi-tenant use cases that we are trying to address with a green field approach, and an appropriate set of management tools as well. Is this going to happen? You can bet that there are already VC funded startups in stealth mode working on various layers of this problem.