Learning from What Went Wrong: The Affordable Care Act Web Portal

Those of us who work on complex computer systems know that it can be a daunting task to get all the different systems to communicate and work properly. The bigger the infrastructure gets, the more complex it becomes. Now, take the most complex system that you have designed or worked with and increase the complexity a hundredfold, and that might give you an idea of the complexity involved with the design and deployment of the Affordable Care Act web portal.

You would have to be hiding under a rock not to have at least heard about the release and grand opening of the latest major cloud offering, the Affordable Care Act marketplace. Now, I have the same knowledge as the rest of you, gleaned from what I have heard and read on the internet, but I have to admit that this deployment has really captured my attention and piqued my curiosity about what the different problems are with the code, systems, and/or design as well as what can be learned and taken away from this “difficult” deployment.

I am not sure whether a full root cause analysis will be released once the system becomes fully functional, but I have read about different problem areas that have been talked about, such as the code itself, the design in general, and capacity. Since the Affordable Care Act is running in a cloud, adding resources to handle greater capacity should be one of the easiest steps to be taken to resolve some of the issues.

In any software release, there are going to be bugs and/or spots of bad code that do not get caught during the testing phase, so some of that is to be expected. Some news sites have reported that some of the code that has been examined has been, well, not quite up to par or, in other words, not done by seasoned professional programmers. This leaves me to speculate that the different components were coded by silo teams (different programming teams that work separately and independently from each other) that may not have had much guidance or direction from the system architects or overall technical leads. Once the silos finished their specific tasks, the different pieces of the puzzle were then simple to just put into their respective places. This could be one reason why the different components of the application are having trouble working with each other.

What if the problem is a flaw in the architecture or the design itself? Without a solid base, the application is doomed to failure. When all the different components of the application have been added and are failing, it leaves you with log files that are flooded with error messages, and all the noise that comes from the application components just makes the troubleshooting process that much harder and more complex. Now you have to assume that nothing is correct, and amid all the noise, you must figure out a way to determine what is needed to help point you in the right direction.

One of the latest things that I have heard is that a “surge” of technical people will be brought in to help troubleshoot and resolve the issues. I can understand bringing in people to comb through all the different logs to help piece together the various issues and determine which issues are caused by other components, but I really have to question what exactly this “surge” of people is going to do. Any time you bring in a new member of a technical team, there is always a learning curve that comes from that member’s being new. Depending on the complexity of the environment the new team member is joining, it could take three to six months, or even up to a year, for that person to get fully up to speed. This is just the way it is, so bringing in a large number of people who will need time to master the learning curve, and considering how complete this application really is, I believe it will be a little while before any real breakthrough happens. It sounds like they are also bringing in new people to take over the lead architect and design roles. These people will need time to develop a complete understanding of all aspects of the application.

So, when do you give up on the troubleshooting and start working on a redesign and rework? I am sure that is a question that is going through the technical surge minds as they start to dig into the technical issues at hand. Either way, I am not having any warm and fuzzy feelings that the Affordable Care Act will be working any time soon.

I understand that this topic can be political in nature, but this was not the purpose of this post. I wanted to ponder the thought and present you with these questions. What if you got a call that you were going to be part of the surge that troubleshoots the application? What would you do?  What would be your plan?

When all is said and done, this will be a great learning tool for others on what not to do when deploying cloud-based applications. I would not be surprised if this deployment ends up being discussed and talked about in college classrooms around the world.

Posted in SDDC & Hybrid Cloud, Transformation & AgilityTagged , , ,