Who Will Debug?

There is a growing push for people to learn less about the systems in which they run their applications. It started with converged infrastructures and moved into hyperconverged, and now I see it continue to grow with Docker and other container technologies. This puzzles me. While it makes the developer’s life easier, does it really make anyone else’s life easier? Do we really need to consider the stack anymore?

In highly distributed systems, there are several key areas still that need consideration, knowledge, and aptitude: network, storage, security, application APIs, and code. I see most people concentrating on application APIs and code while leaving the other areas to be handled by someone else or ignoring them due to an “it just works” mentality. What has happened to the systems engineer of the past? And more importantly, what happens when the system goes wrong and you have to debug it?

DevOps has an answer to that: kill it, and start over. Well, that is great, but my cattle are in herds; what impacts one cow could impact the herd. So I need to call in a veterinarian to diagnose my cow’s illness and determine if the rest of the herd is infected with a problem. I wrote on this mentality in the past. I like to herd my workloads around rather than kill them and start over. Why? Mainly because I want to learn more, solve the problem, and then move on.

I think the “kill and start over” mentality sprang from Microsoft’s method of debugging Windows: “Just reboot; it will fix everything.” But that is just not true. Rebooting does not fix anything. It masks everything, but the underlying problem is still there.

For example, I was told by a group that its systems would enter read-only mode on a regular basis but that if it rebooted the hypervisor, everything would be fixed for a set amount of time. Eventually, that failure spread to the entire cluster, and reboots become more and more painful. The ultimate root cause of the problem was failing but not failed hardware. This is hard to detect unless you know exactly how the system works, what to look for, and how to read the myriad logs within the stack from the hardware up to the hypervisor, to the guest operating system, and eventually to the application log for all nodes in the cluster. It was, after all, a cluster issue.

That one problem makes me wonder if hyperconverged architectures will have the same issue and the only solution will be to reboot the environment, or just to switch out all hardware without first finding the issue. I just do not have a warm feeling that those building and selling EVO:RAIL and other HCI environments have the skills to do deep root cause analysis. At the moment, most tools that claim to do root cause analysis are not very deep and have too many false positives, or require you to be able to read the myriad logs once more to find the real answer. vRealize Operations, with its vRealize Log Insight integration, requires just that. Ultimately, it is the logs to which you refer.

When looking at Docker/containers and HCI, one question to answer is how do you determine the true root cause of any problem? For HCI, this implies that support departments should contain senior-level, knowledgeable people who can work off a script, do the research, and  provide an answer quickly and with much thought and consideration. I do not see many container or HCI support groups that include senior-level people.

So, who debugs your stack? How do you find the root cause of any problem? This is, after all, still required without pointing fingers. Furthermore, do the vendors who provide HCI have the knowledge and aptitude to provide this level of support?

Posted in SDDC & Hybrid CloudTagged , , ,