VSA Resources: Smoking Gun or Red Herring?

In a previous article, I wrote that customers don’t care whether a hyperconverged solution uses a VSA or runs the storage cluster in-kernel. I stand by that assertion. One of the comments pointed out that I had missed an area of discussion: that of the resource requirements of the VSA itself. I still don’t think that customers care, but for completeness, I’ll examine them. The point here is that the VSA that most HCI vendors use to provide shared storage is usually a fairly beefy VM. The resources allocated to the VSA are not available to run workload VMs. This logic says that the VSA-based HCI can run fewer VMs than an in-kernel-based HCI. The problem with this argument is that most of the VSA resources are doing storage cluster work. Moving the same storage cluster into the kernel requires almost the same resources. The big difference with in-kernel resource usage is that there isn’t something you can easily point to as taking up these resources. VSA resource usage is all assigned to the VSA; in-kernel resource usage can’t be accounted to a single object. There is no smoking gun of resource usage.

The first thing to do is to quantify the size of these VSA VMs. In my experience, the VSA that is on each HCI node uses between 24 GB and 96 GB of RAM. The exact number depends on the amount of storage to be managed and the data-efficiency features in use. More persistent storage capacity means more RAM required for simply keeping track of the number of objects being stored. Usually, it is the amount of storage capacity in each node that drives the RAM requirement, rather than the total size of the cluster. Adding deduplication increases the amount of RAM used. There is additional metadata that must be in the fastest storage media available. Having tiered storage (SSD plus HDD) also means more RAM used for metadata. The other great use of RAM in the VSA is for optimizing performance. Spare RAM in the VSA is usually used as a read cache to give outstanding performance for reading the most frequently accessed data blocks. These RAM requirements aren’t specific to an HCI platform. Pretty much every modern storage array uses a bunch of RAM for exactly the same purposes. It’s not unusual for the storage controllers in a mid-range array to have 256 GB of RAM each. And there is my key point: to do the work that the VSA is doing, you need a certain amount of resources. If you do the same work in a storage array, you need similar resources, and if you do the same work in-kernel, then you need the same resources.

For the vast majority of HCI customers, the VSA will use around 10% to 20% of the resources of each HCI node. This is a resource demand that needs to be accounted for in capacity planning. Your HCI nodes will deliver fewer compute resources to workload VMs than if the same physical servers accessed a Fibre Channel SAN. Whether the work is done in-kernel or by a VSA may make a difference of a couple of gigabytes of RAM and a little CPU time on each node. This isn’t a significant amount of resources to most customers.

Circling back to the original point, customers don’t care whether your storage cluster is VSA or in-kernel. They do care how much of each node’s physical resources are available to run VMs. This is the RAM and CPU capacity that they want on your HCI spec sheet. Installed RAM capacity that they cannot use for workload VMs is not interesting. The whole point of HCI is to focus on the workload VMs.

What about the small number of customers for whom the few GB of resources and few GHz of CPU time is important? A customer whose entire VM estate uses 64 GB of RAM (e.g., eight VMs with 8 GB of RAM each) is likely to be concerned with every last physical GB. If they need a three-node cluster and each node loses 8 GB of RAM to the VSA operating system, then the overhead is significant. These customers still don’t care if it’s a VSA or in-kernel. What they care about is that your storage cluster scales down its resource usage: scales it way down. This scaling is hard; making the same solution work for both ten VMs and ten thousand VMs is difficult. I would be surprised if the hyperconverged products that support thousands of VMs scaled down to do a great job of running ten VMs. Similarly, I doubt that a product designed to support a dozen VMs will effectively solve the problems of a customer with fifty dozen VMs.

Customers do not care if your HCI uses an in-kernel storage cluster or a VSA. They care whether your HCI is a good solution to their problems. Can your platform run their VMs in a cost-effective way? How easy is your HCI to operate, upgrade, and maintain? Is there more value in replication, backups, and DR that will solve their problems? Customers should not have to care how your HCI works; it’s a red herring. Sell your customers on what your HCI does for them.

Posted in SDDC & Hybrid CloudTagged , ,

Leave a Reply

4 Comments on "VSA Resources: Smoking Gun or Red Herring?"

Sort by:   newest | oldest | most voted
Alastair, I have enjoyed reading your articles. I think it very important that vendors try to win over customers on the merits of their HCI offering and not focus on semantics like how those merits are working in the background. However, I am not sure I agree with one of your core pieces of rationale for why VSA=Kernel: “Moving the same storage cluster into the kernel requires almost the same resources.” In my experience, this has not shown to be the case, but personal experiences aside, I think the time is right to perform an empirical study comparing the resource… Read more »
Hello Chip, While we would very much like to do such research, it is almost impossible to do without picking an In-Kernel or VSA that an HCI vendor is NOT using and therefore open such research to the same type of comments. Could we use VSAN and StoreVirtual as an example? Yes, we could but that is not the same as picking the internals of Nutanix, Scale, Simplivity, etc. There are not many In-Kernel that we can just install and use. We would need a stack that is A) 100% software, B) un-entangled by hardware restrictions/requirements. Now that being said,… Read more »

Perhaps it would be a good start to gather figures on most-common configurations often seen and deployed. I agree it is somewhat problematic either way, but figures don’t necessarily have to be framed in a competitive manner, merely test results from which others are free to draw their own conclusions.