A very interesting thing happens as your vSphere environment scales up. That every interesting thing is that the larger your environment gets, the more frequently you need data about its performance, capacity and configuration state. This is simply because the more things that there are in the environment, the more likely it is that something is wrong with one of them at any moment in time.

The Standard vSphere Management Model

The standard way in which most companies manage vSphere is that they configure vCenter to present data in either 5 minute or 30 minute rollups. What in fact happens in either case, is that vCenter is collecting the data from each host every 20 seconds, but it then averages those 20 second polls of data and presents and updated average every 5 minutes or 30 minutes. The only thing that is stored, and made available to management solutions (like vCenter Operations, and the entire third party ecosystem of performance and capacity management tool)s is the rolled up and averaged 5 minute or 20 minute data.

This creates the following set of problems:

  • A whole lot (especially if you have over 1000 hosts) can go wrong in 4 minutes and 59 seconds or 29 minutes and 59 seconds and you will not see it until you get the updated 5 minute or 30minute poll of data.
  • The magnitude (severity) of a problem can easily be obscured by the averaging process. Periodic things that are really slow can get averaged out in a flood of other normal data. That means that you can become blinded to things that might seem intermittent to your users, but that are in fact repeatedly bad, but just not happening often enough to stand out.
  • If you try to configure vCenter to collect data more frequently (like by setting data collection to Level 3 or Level 4 all of the time). you vCenter database becomes so large so quickly as to become unmanageable.

The Right Way to Collect Infrastructure Performance Management Data

So what is the right thing to do? The answer is get as close to getting all of the data you need on the following basis:

  • Real Time – A highly abused term by marketing people real time does not mean that if something is available every 20 minutes, that you collect it at the moment it is available and display it instantly. Real Time means that you get the data instantly. In general this means you have to get the data yourself, as commodity API’s do not make data available in real time. So real time means that you are seeing things at the one second or even sub-second level of granularity.
  • Comprehensive – Get all of the data. That means get the data for every infrastructure or application transaction that you care about. Sampling causes you to miss things that matter.
  • Deterministic – Get the actual data, not an estimate or an average of the data.

The Reality of What is Possible with vSphere

The reality is that it is not practical to collect data from vCenter any more frequently than every 5 minutes. Therefore if you want to get any closer to real time (every five minutes is NOT real time) you have to go with a solution that collects the data directly from every vSphere host every 20 seconds, or go with a solution that does not rely upon data from the hypervisor at all.
As soon as you decide that you need the most frequent and best data that vSphere can give you for your large vSphere environment, you have created the following situation for yourself:
  • You have made managing your vSphere environment into a big data problem.
  • This means that a management product that relies upon a back end relational database server will not meet your needs
  • This means that the if you want all of that data to be useful to you in any kind of a reasonable period of time, your management solution better include some very fancy analytics like a Complex Event Processing engine, or a real time continuous self-learning capability.

Some Near Real Time vSphere Management Alternatives

If you cannot get a comprehensive end-to-end real-time picture of the performance of your large vSphere infrastructure today (you can’t, no such solution exists), then you have to get as close as you can get. Here are some great alternatives to consider:
  • Reflex Systems VMC. Reflex collects the 20 second data from every vSphere host, stores it in a big data capable database, analyzes immediately upon receipt with a complex event processing engine, and then provides you immediately updated visibility into your environment. Reflex has recently updated their offering and the announcement is available here.
  • Xangati. Xangati specializes in using an understanding of the network and of storage latency to understand the performance of your vSphere system. This is particularly valuable in VDI situations where problems are almost always network or storage latency related.
  • ExtraHop Networks. ExtraHop provides a physical appliance that sits on a physical mirror port of a switch, and a virtual appliance that sits on the virtual mirror port on virtual switches. ExtraHops positions its product as an Applications Performance Management solution (which it is), but ExtraHop also decodes database protocols and storage protocols. This means that ExtraHop can give you real-time (really real time) visibility into how database queries and storage latency are interacting to create problems for you.
  • Virtual Instruments. Virtual Instruments gives you a truly real-time, deterministic, and comprehensive picture of the performance (latency) of every transaction that flows across your fiber channel SAN. This means that VI is seeing the performance of all of your storage arrays down to the LUN level with this level of granularity as well.
Conclusion
When your vSphere environment gets big, managing it becomes a big data problem, requiring real time or near real time data collection, complex real time analytics, and the ability to store massive quantities of data arriving at a high data rate.

 

Share this Article:

Share Button
Bernd Harzog (336 Posts)

Bernd Harzog is the Analyst at The Virtualization Practice for Performance and Capacity Management and IT as a Service (Private Cloud).

Bernd is also the CEO and founder of APM Experts a company that provides strategic marketing services to vendors in the virtualization performance management, and application performance management markets.

Prior to these two companies, Bernd was the CEO of RTO Software, the VP Products at Netuitive, a General Manager at Xcellenet, and Research Director for Systems Software at Gartner Group. Bernd has an MBA in Marketing from the University of Chicago.

Connect with Bernd Harzog:


Related Posts:

8 comments for “The Real-Time Big Data vSphere Management Problem

  1. May 11, 2012 at 10:16 AM

    Not correct, qiote easy to get 20 sec newr reltime data from vsphere vcenter’s api’s or GUI ;-)
    Eg.
    Customize Performance Chart dialog box, select the Memory resource type and the Real-Time display interval.

  2. Bharzog
    May 12, 2012 at 11:01 AM

    Hi Lars,

    Good. Now configure (if you can) vCenter to collect every metric that it can from every server every 20 seconds and to roll it up as quickly as possible. In other words, set it to Level 4 data collection and see what happens. If you have any kind of a decent size production environment, please do not do it there, as this will crater your vCenter. If you are foolish enough to do this, please try it in a test/dev environment.

    Cheers,

    Bernd

  3. vicky
    May 12, 2012 at 1:21 PM

    I don’t know the impact of setting vCenter to level 4 data collection; but will try soon in my vlab. But what I know is that Xangati, Reflex Systems and Extrahop Networks are sponsors of this site.
    Cheers!

  4. May 12, 2012 at 1:43 PM

    I said that you in fact can get 20 sec near real time data from the vSphere api’s or in the GUI itself. I did not say or advise to setup vcenter to collect those 20 sec statistics in the database, that would be foolish in a production environment. It is quite easy to pull those values off vcenter and store it in an external db though, as we do in our environment.

    Best Regards

    Lars Wean

  5. May 13, 2012 at 12:28 PM

    Of course, I am talking about the not archived “real-time” (past-hour) stats, which are refreshed every 20 seconds, and are displayed for the past hour in the VI client. These stats are not stored in the database.
    One hour of real time statistics are often good enough when troubleshooting issues. You can also use esxtop in batch mode and import the data into excel or esxplot, http://labs.vmware.com/flings/esxplot

    Best Regards

    Lars Wean

  6. Bharzog
    May 14, 2012 at 9:26 AM

    Hi Vicky,

    Go to the Performance Management page on our site (http://www.virtualizationpractice.com/topics/performance-management/). Look at the logos on the right side. Notice that there are 26 vendors in the Performance Management business for virtualization and the cloud that are sponsors of this topic on our site. Also notice that the article mentioned Virtual Instruments, who is not a sponsor of the site. If you are suggesting that we are playing favorites, then I think the evidence refutes that suggestion.

    Cheers,

    Bernd

  7. May 25, 2012 at 1:41 AM

    I am a product manager for Precise Software Solutions.
    Precise uses an alternative approach for the Big Data problem outlined here.
    Our approach is to focus on the metrics that actually affect the application’s performance and the overall user experience.

    When troubleshooting an application performance issue, you care more about what has changed in the environment, than about the absolute values of all possibly collected metrics.
    With this approach, the performance solution correlates the collected data with the current and baseline application behavior, and loads the correlated results into the repository.
    This greatly reduces the amount of data that is going into the repository, while making sure that the user gets the important metrics when analyzing a performance issue.

    Regards,
    Assaf

  8. June 3, 2012 at 6:40 AM

    @Assaf Good strategy !

Leave a Reply

Your email address will not be published. Required fields are marked *


two × 7 =