Troubleshooting from the Trenches

This day seemed to start like any other but it seems like as soon as I was logged in to start my day issues arose.  It seems like I lost one of my VMware 3.5 ESX servers and all the virtual machines on the host were knocked offline.  This should not have been a big deal since HA was enabled but, Murphy’s Law has a way of making life really interesting.  So as I logged into the vCenter client I noticed that the host in question was in a disconnected state and all the virtual machines showed up as disconnected.  In past experiences I have seen HA, during a host failure,  recover the virtual machines in under five minutes.  So I waited and waited thinking HA should have kicked in by now.  Time for a little further investigation!!

It seems the host was still network available as I was able to connect via ssh but was not able to connect via the vCenter client directly to the host itself.  With this in mind: the vCenter agent (vmware-vpxa) and/or management daemon (hostd) was my first area to explore. After restarting the vmware-vpxa and mgmt-vmware services  it still would not connect back up with vCenter.  It seems that once I tried to restart the vmware-vpxa service it would fail on the shutdown but would start the services successfully.  This tells me that the service most likely crashed right after starting.

In the meanwhile, the SLA clock is ticking and now alerts are really going out and the IM started to really go crazy.  First things first, I needed to stop the SLA clock and get the virtual machines back up and running so I restarted the host in question and watched HA do its magic as the virtual machines started on new hosts and I was able to close the tickets for the virtual machines being down.

Now the fun begins in figuring out what went wrong.

Once the host came back up I put it in Maintenance Mode to keep all the virtual machines in the cluster off of this host and then jumped in the logs and see what I could find.  In the vmkernel log I found something interesting:

StorageMonitor: 196: vmhba1:0:2:0 status = D:0x0/H:0x7 0x0 0x0 0x0

So now we are getting somewhere, seem like there is a storage issue going on but what does the status tell me? A quick Google search bring me to a great page on the VMTN community forums to reference.

“Status Message” Consists of four decimal and hex CODE blocks :”Device Status”/”Host Status” “Sense Key” “Additional Sense Code” “Additional Sense Code Qualifier” here is the log line’s explanation (24/0 0x0 0x0 0x0) :

24/0 dec : “Reservation Conflict” / “Host_OK”
0x0 hex : “No Sense Information”
0x0 hex : “No Sense Information”
0x0 hex : “No Sense Information”

Non zero “Device Status”/”Host Status” values indicate a problem.

There is a reservation issue, but the host reports OK. First thing that comes to mind is there is an issue with the communication between the host and the SAN.  I got the SAN team to take a look at the logs for the fiber switch as well as the SAN.  The SAN team reported that there was nothing in the logs except for the reboot that I forced.  This would seem to indicate that there is a problem with the fiber card but what was really bugging me was the fact that there were two fiber channel cards in the host and if there was a problem with one of the cards then I expected the host to failover communication to the other card which did not happen.  It seems that if the fiber channel card cannot get the instruction from the SAN to failover it will just keep retrying over and over again which was pointed out in the log file entries as seen below.

lpfc0:0712:FPe:SCSI layer issued abort device Data: x2 x0

FPe:SCSI timeout Data: xb6c8488 x98 x16c9bc42 x18d
44:06:00:00.058 cpu2:1396)<4>lpfc0:0754:FPe:SCSI timeout Data: xb6c8078 x98 x16c9bc47 x18e
44:06:00:00.058 cpu2:1396)<4>lpfc0:0754:FPe:SCSI timeout Data: xb716078 x98 x16c9bc47 x18f
44:06:00:00.058 cpu2:1396)<4>lpfc0:0754:FPe:SCSI timeout Data: xb716280 x98 x16c9bc49 x190
44:06:00:00.058 cpu2:1396)<4>lpfc0:0754:FPe:SCSI timeout Data: xb716488 x98 x16c9bc49 x191
44:06:00:00.058 cpu2:1396)<4>lpfc0:0754:FPe:SCSI timeout Data: xb716690 x98 x16c9bc4f x192

Hard lesson learned that even with redundancy in place, sometimes the outcome you expect does not happen.  HA was unable to kick in because as mentioned earlier the host was still network available and as far as HA goes the host was OK and the system sat with the vCenter agent crashed as well as the SCSI timeout.  Now time to schedule the change to update the firmware as well as replace the fiber channel card.

To troubleshoot your virtual environment you must know how to read the log files, and nothing will prepare you for this except experience. Reading log files is more a art form than a science in some cases, so practice makes perfect. If you have a known good host, and then compare that to a host with issues. These same issues can occur within VMware vSphere however the trigger strings are slightly different. Know thy logfiles and you will be able to troubleshoot most problems.

While we are at it here is the legend for SCSI Error Strings just to be able to present additional information to help others troubleshoot any issues they are having that are close:

Device Status (Decimal):

0 : No Errors
2 : Check Condition
8 : Device Busy
24 : Reservation Conflict

Host Status (Decimal):

0 : Host_OK
1 : Host No_Connect
2 : Host_Busy_Busy
3 : Host_Timeout
4 : Host_Bad_Target
5 : Host_Abort
6 : Host_Parity
7 : Host_Error
8 : Host_Reset
9 : Host_Bad_INTR
10 : Host_PassThrough
11 : Host_Soft_Error

Sense Key (Hex) :

0x0 : No Sense Information
0x1 : Last command completed but used error correction
0x2 : Unit Not Ready
0x3 : Medium Error
0x4 : Hardware Error
0x5 : ILLEGAL_REQUEST (Passive SP)
0x6 : LUN Reset
0x7 : Data_Protect – Access to data is blocked
0x8 : Blank_Check – Reached an unexpected region
0xa : Copy_Aborted
0xb : Aborted_Command – Target aborted command
0xc : Comparison for SEARCH DATA unsuccessful
0xd : Volume_Overflow – Medium is full
0xe : Source and Data on Medium do not agree

ASC/ASCQ (These are always in pairs, ASCQ usually 0, Hex) :

0x4 : Unit Not Ready
0x3 : Unit Not Ready – Manual Intervention Required
0x2 : Unit Not Ready – Initializing Command Required
0x29 : Device Power on or SCSI Reset

Posted in IT as a ServiceTagged , , , , , ,