The Real-Time Big Data vSphere Management Problem

Virtualization Management

A very interesting thing happens as your vSphere environment scales up. That every interesting thing is that the larger your environment gets, the more frequently you need data about its performance, capacity and configuration state. This is simply because the more things that there are in the environment, the more likely it is that something is wrong with one of them at any moment in time.

The Standard vSphere Management Model

The standard way in which most companies manage vSphere is that they configure vCenter to present data in either 5 minute or 30 minute rollups. What in fact happens in either case, is that vCenter is collecting the data from each host every 20 seconds, but it then averages those 20 second polls of data and presents and updated average every 5 minutes or 30 minutes. The only thing that is stored, and made available to management solutions (like vCenter Operations, and the entire third party ecosystem of performance and capacity management tool)s is the rolled up and averaged 5 minute or 20 minute data.

This creates the following set of problems:

  • A whole lot (especially if you have over 1000 hosts) can go wrong in 4 minutes and 59 seconds or 29 minutes and 59 seconds and you will not see it until you get the updated 5 minute or 30minute poll of data.
  • The magnitude (severity) of a problem can easily be obscured by the averaging process. Periodic things that are really slow can get averaged out in a flood of other normal data. That means that you can become blinded to things that might seem intermittent to your users, but that are in fact repeatedly bad, but just not happening often enough to stand out.
  • If you try to configure vCenter to collect data more frequently (like by setting data collection to Level 3 or Level 4 all of the time). you vCenter database becomes so large so quickly as to become unmanageable.

The Right Way to Collect Infrastructure Performance Management Data

So what is the right thing to do? The answer is get as close to getting all of the data you need on the following basis:

  • Real Time – A highly abused term by marketing people real time does not mean that if something is available every 20 minutes, that you collect it at the moment it is available and display it instantly. Real Time means that you get the data instantly. In general this means you have to get the data yourself, as commodity API’s do not make data available in real time. So real time means that you are seeing things at the one second or even sub-second level of granularity.
  • Comprehensive – Get all of the data. That means get the data for every infrastructure or application transaction that you care about. Sampling causes you to miss things that matter.
  • Deterministic – Get the actual data, not an estimate or an average of the data.

The Reality of What is Possible with vSphere

The reality is that it is not practical to collect data from vCenter any more frequently than every 5 minutes. Therefore if you want to get any closer to real time (every five minutes is NOT real time) you have to go with a solution that collects the data directly from every vSphere host every 20 seconds, or go with a solution that does not rely upon data from the hypervisor at all.
As soon as you decide that you need the most frequent and best data that vSphere can give you for your large vSphere environment, you have created the following situation for yourself:
  • You have made managing your vSphere environment into a big data problem.
  • This means that a management product that relies upon a back end relational database server will not meet your needs
  • This means that the if you want all of that data to be useful to you in any kind of a reasonable period of time, your management solution better include some very fancy analytics like a Complex Event Processing engine, or a real time continuous self-learning capability.

Some Near Real Time vSphere Management Alternatives

If you cannot get a comprehensive end-to-end real-time picture of the performance of your large vSphere infrastructure today (you can’t, no such solution exists), then you have to get as close as you can get. Here are some great alternatives to consider:
  • Reflex Systems VMC. Reflex collects the 20 second data from every vSphere host, stores it in a big data capable database, analyzes immediately upon receipt with a complex event processing engine, and then provides you immediately updated visibility into your environment. Reflex has recently updated their offering and the announcement is available here.
  • Xangati. Xangati specializes in using an understanding of the network and of storage latency to understand the performance of your vSphere system. This is particularly valuable in VDI situations where problems are almost always network or storage latency related.
  • ExtraHop Networks. ExtraHop provides a physical appliance that sits on a physical mirror port of a switch, and a virtual appliance that sits on the virtual mirror port on virtual switches. ExtraHops positions its product as an Applications Performance Management solution (which it is), but ExtraHop also decodes database protocols and storage protocols. This means that ExtraHop can give you real-time (really real time) visibility into how database queries and storage latency are interacting to create problems for you.
  • Virtual Instruments. Virtual Instruments gives you a truly real-time, deterministic, and comprehensive picture of the performance (latency) of every transaction that flows across your fiber channel SAN. This means that VI is seeing the performance of all of your storage arrays down to the LUN level with this level of granularity as well.
When your vSphere environment gets big, managing it becomes a big data problem, requiring real time or near real time data collection, complex real time analytics, and the ability to store massive quantities of data arriving at a high data rate.


Posted in IT as a Service, SDDC & Hybrid CloudTagged , , , ,