IO IO it is off to Storage and IO metrics we go

By Greg Schulz, Server and StorageIO @storageio

IO, IO, it’s off to storage and IO metrics we go shifting gears a bit from the recent four part series around SSD topics for physical and virtual environment. A while back I did a post about Why VASA is import to have in your VMware CASA along with another piece about Windows boot IO and storage performance impact on VDI planning. Among other things those two pieces have in common is a theme around the importance of storage and IO metrics that matter.

There are many different types of metrics pertaining to storage and IO that can be grouped into Performance, Availability, Capacity, Economics and Energy (PACE). Every application or workload has some characteristics attribute of Performance, Availability, Capacity, Economics and Energy that can be further broken down into metrics that are more detailed.

  • Performance: IOPS or activity including transactions or access, bandwidth, throughput or data being moved, along with response or wait time, latency and queuing or delays.
  • Availability: Includes basic and high availability (HA), backup/restore, business continuance (BC) and disaster recovery (DR). Also included are general reliability availability and serviceability (RAS), planned and unplanned downtime. Another dimension to availability is mean time between failure, (MTBF), annual failure rate (AFR), mean time to repair (MTTR) and number of nines or percent time available in a year or other interval. Endurance or duty time is another availability metric that has a bearing on nand flash SSD that wears out over time from repeated program/erase (P/E) use cycles.
  • Capacity: Includes amount of space including disk storage, memory, processing capabilities, ports or connectivity among others. Other capacity related metrics include data footprint reduction ratios such as with compression or dedupe. Capacity can be measured in raw or un-configured, configured, provision or allocated used or unused (free space). Another aspect of capacity is overhead due to data protection such as mirroring, snapshots, replication along with file system or other configuration space. Capacity other than from use should stay constant, however over time space can decrease as errors or bad blocks are detected on magnetic media as well as P/E cycles on SSD. In addition to hardware, software licenses are also a part of capacity.
  • Energy and economics: Costs should be obvious economic metrics tied to a given resource usage or configuration such as performance, availability, capacity or energy, either capital or operating. Energy metrics can take the form of British thermal units (Btu) to measure heat produced by doing work, electrical power consumption in kilowatt-hours (kWh), and power usage efficiency such as the green grid PUE among others.

Additional metrics tied to the above include errors or events along with health and status. Then there are service related metrics including service level agreements (SLAs), service level objectives (SLOs), quality of service (QoS) along with recovery time objective (RTO) and recovery point objective (RPO). The above and other metrics can be used as is, or combined together to create compound metrics for various purposes. For example combing different metrics to determine IOPS per watt of energy or cost of bandwidth per watt of energy per given availability or protection level.

For the past several years, a popular storage related metric has been cost per capacity such as dollar per GByte or dollar per TByte, which still applies today. However more recently there has been a growing awareness around Input/output Operations Per second (IOPs) which helps to expand the discussion from simply a cost per capacity basis. IOPS are good for discussing reads or writes, files or web pages accessed transactions or other forms of random and sequential, big and small activity. On the other hand, IOPS do not provide the entire picture of what is going on with storage and IO from a performance perspective that requires looking at bandwidth or throughout also known as data transfer rates, aka bytes read and written.

Another metric that comes into play is latency or response time that is measurement of how long you have to wait to get your data or work accomplished. IOPS, bandwidth and latency are all interrelated and thus there is a cause and effect to be aware of. For example, if you just installed a 10Gb E network adapter on a fast server with lots of fast memory attached to a fast network switch accessing a fast storage device using iSCSI or NAS (NFS or CIFS) and concerned you are not getting the throughput speed, do you have a problem? It depends! First is determining if your application is capable of generating the workload and more importantly, are the IOs big or small.

If you have empirical baseline metrics from the past showing response time improved when you upgraded to 10Gb E, and that more activity (transactions, IOPS, frames or packets per second) are being processed, that may not be a problem. Likewise, if the amount of data being read or written increased proportionally to what it was before and to the amount of activity increased, you probably do not have a problem other than looking at the wrong metrics that matter. What I mean by that is if you were focused on bandwidth and your application is doing lots of small activity (IOPS, transactions, data transfers) you should not be surprised to see low bandwidth. After all, what is your focus, making the application able to do more work and enable workers to wait less for information, or drive up bandwidth?

On the other hand, if you were just looking at IOPS and your application was doing large random or sequential activity, you should not be surprised to see low IOPS with a high bandwidth, not to mention a longer response time.  For example if you bought a SSD based or enabled solution and are not seeing anywhere close to the IOPS performance claims, perhaps the issue is an apples to oranges one. What I mean is that some SSD based benchmarks are done on very small 1KB (1,024 bytes) or a single 512-byte page (1/2 KB) reads or even smaller 64 byte. If those IO sizes are representative of your environment, then they are applicable. However if your environment is more likely to be doing 65% random reads with an average size of 4KB or 8KB with a mix of random and sequential writes, then out of this world type benchmarks while interesting are probably not applicable.

In other words, what metrics that matter for your environment are those that are applicable to your needs and requirements. They might be performance, availability, capacity, energy or economic centric, some mix of all of them or variations. Don’t be scared of server, storage and IO metrics, instead learn more about them including what matters when, where and why along with how to get and use them.

We are just scratching the surface here with more to cover later. So until next time, IO IO it’s off to storage and IO metrics I go.

Posted in IT as a Service, SDDC & Hybrid CloudTagged , , , , , , , , , ,