Big Data Operations Management

Virtualization and cloud computing are not just innovations that require the support of new environments in existing operations management solutions. Instead, virtualized and cloud based environments are so different from their predecessors that an entirely new management stack will have to be built in order to effectively manage these environments. This new stack will be so different that it will replace, instead of augment the legacy/incumbent management stacks from legacy vendors. This ushers in the era of Big Data Operations Management.

When do you need Big Data Operations Management?

The short answer is that depending upon who you are and what you are trying to do, Virtual Operations Management may or may not be a big data problem. If your environment is reasonably small, and you are running applications where neither they nor their supporting infrastructure require high fidelity monitoring and you are not changing the environment very frequently then Virtual Operations Management can be addressed without using big data approaches.

However, if your environment is large (let’s say 1,000 physical hosts or more) and you are running business critical and performance critical applications in that environment, then you are going to want high fidelity monitoring of that environment. This is particularly true if short outages or brownouts are a severe business problem with those applications. If this is the case, then highly granular and extremely frequent data collection will be critical, and that you will want to focus on the following factors:

  • Real time data collection. Every five minutes is not good enough. Even if that five minute data point is an average of 15 samples collected every 20 seconds over that 5 minute period. You are going to want something that collects the data from each vSphere host itself every 20 seconds. Ideally, even more frequently although the 20 seconds is a current limitation of vSphere. The key question that you want to ask here is, “how long am I willing to wait in between the time that something bad happens, and the data collection system in my management product notices it”? For online business critical applications the answer might be no more than one second.
  • Real time event processing. Once the management system has collected the data, and has collected the presence of a problem, how long does it take for that system to be able to raise an alert or take an action? This is that many products refer to as “real time”. But most of them ignore the delays in collecting the data in the first place mentioned above.
  • Comprehensive data collection. This means not missing events, or missing peaks in response time or latency or drops in thoughput. It also means broadening the waterfront of what is collected to include data from a variety of sources.
  • Deterministic data collection. This means getting as close to getting the actual value that matters and not an estimate. Averaging occurs at all levels of the data collection stack. Operating systems inherently sample and provide either periodic samples or rolled up averages of multiple samples. Averaging at any level obscures valuable data and can seriously mislead one into thinking everything is OK when it is not.

The short answer is that if you have a large virtualized environment, running business critical  or performance critical applications, with any kind of automated operations (DRS turned on) or any kind of cloud (with workloads introduced via self-service provisioning), operating that environment might be sufficiently challenging so as to require the kind of data collection that would drive the selection of a big data approach. For most enterprises, the critical change here will be the forthcoming virtualization of business critical applications. Moving these applications from dedicated and over-provisioned hardware to shared and dynamic environments will be fiercely resisted by the owners of those applications unless assurances as to highly competent operations management can be provided.

The Splunk Approach to Big Data Operations Management

Splunk has achieved the ability to apply big data principles to operations management by starting with a clean sheet of paper and solving a related problem – the collection and indexing of log data. It turns out that long before virtualization caused operations management to be a big data problem, collecting logs from numerous sources with frequency was already a big data problem. Splunk therefore designed an architecture around three principles; 1) as many data collectors (forwarders) as are needed to capture the data may be deployed, 2) the process of finding the relationships between the data can be handled in a scaled out manner through the use of MapReduce a technique for breaking work into chunks so that they can be be spread among N commodity computers, and 3) the distribution of the search and query functions across as many scaled out search heads as are necessary to do the work. Splunk therefore combines the ability to collect a nearly infinite amount of data with the ability to cope with that data in a scale out manner at all three layers of its core architecture.  The search language and visualization capabilities help put the performance and log data to use not just for operational monitoring, but also for capacity planning, usage tracking and analytics, chargeback, security and user intelligence.


The Splunk Big Data Operations Management Architecture (click to expand)

Big Data Operations Management: Splunk

The Reflex Systems Approach to Big Data Operations Management

Reflex Systems’ enterprise software solutions provide the ability to manage, scale, and automate virtualized data centers utilizing big data principles. Reflex Systems collects vast amounts of steaming data from a various sources in the virtualization layer and its supporting hardware on a near real-time basis. The solution is designed to scale up to the largest VMware environments through the use of distributed collection nodes and graph-based data mapping techniques. A variety of data in the form of performance & capacity metrics, configuration & topology change, resource allocation/utilization, events and alarms, etc. are collected, stored and correlated. Reflex leverages its own federated domain specific language, the virtualization query language (VQL), and complex event processing (CEP) technology to allow the query and analysis of data in real time. From there, the Reflex software can wrap context and present data to users in a meaningful way through various user targeted applications, for example resource right sizing, capacity planning, or identification of resource contention.

The Reflex Big Data Operations Management Architecture (click to expand)


The CloudPhysics Approach to Big Data Operations Management

CloudPhysics is unique in the Operations Management space on two fronts. CloudPhysics offers its solution as a secure hosted service meaning that it hosts the big data back end in its cloud so you do not have to. The second aspect of uniqueness is that CloudPhysics is investing heavily in the use of big data analytics in its back end. The combination of a hosted service and the big data analytics uniquely allow CloudPhysics to analyze data across customers (in a blinded manner of course). This allows the company to provide a level of value to its customers that cannot be provided in the absence of the cross-customer database and the analytics applied to it.

The CloudPhysics Approach to Big Data Operations Management (click to expand)

Big Data Operations Management: CloudPhysics

The Xangati Approach to Big Data Operations Management

Xangati focuses upon collecting performance data from the physical and virtual networks, and the storage layer supporting the virtualized environment in as near-real time a manner as possible, and then analyzing this information in real-time via an in-memory database and associated real-time analytics. The ability to cross-correlate disparate information leads Xangati to be able to detect what it calls “storms” in a live and continuous manner. Xangati is therefore a near-real time infrastructure performance management solution designed to allow the user of the product to find issues in the environment quickly enough to in many cases resolve them before they have a material impact upon the performance of the workloads and the applications in the environment.

The Xangati Big Data Operations Management Architecture (click to expand)

Big Data Operations Management: Xangati


If you are running business critical or performance critical applications in shared (virtualized or in a public cloud) environment then monitoring that environment might be a big data problem. If your environment is large enough to make monitoring into a big data problem, then choosing a product from a vendor like CloudPhysics, Reflex Systems, Splunk or Xangati might be a good choice.