- How to use vRealize Log Insight to retrieve logs from your Windows and Linux servers - Fri, Jul 23 2021
- Analyze basic log output from VMware vSphere 7 products - Wed, Jul 21 2021
- Analyze drmdump files with VMware DRS Dump Insight - Fri, Jul 9 2021
A variety of factors may come into play. You might be facing problems with the applications or guest OS within the VM, problems with the VM monitor or virtual devices, resource contention on the host, or issues with underlying storage or networking infrastructure.
Your environment might be only a single host or a large vSphere cluster. In the latter case, you might simply evacuate all other VMs running on this host and restart it to try to fix the problem. However, if you cannot restart the host because it is the only one you have for that particular VM, you might try additional ways that may help.
It is important to have accurate symptoms and an understanding of the scope of a problem. To confirm the scope of the problem, work through these checks:
You should always confirm that the VM is unresponsive. To do this, simply open the VM console and check a few things:
- Verify you can open the console. If not, you might see an error. In this case, the VM is most likely having an issue, or there is a network connectivity problem.
- Check whether the VM is responding from time to time (or not at all). Connectivity may be indeterminate.
- Validate that there is some movement on the console or that the VM is running but the guest OS may be unresponsive to input.
- Validate the guest OS and keyboard—if the guest OS responds to keyboard or mouse interaction, then the VM and guest OS are functioning correctly but might not react via the network or other channels.
- Try to use a ping command from another VM running on the same host, the same vSwitch, and the same network. If the VM has two network interface cards (NICs), you might have connectivity via one NIC, but the other might have problems. In this case, it might be network driver corruption within the guest OS.
- The VM might be running from the ESXi perspective (green icon), but the underlying OS isn't responding to a network ping, and the console is black. The VM might have gone black after a blue screen of death (BSOD).
- Check whether the host is responding and whether you can connect via vSphere Client or host client. The host might be running but is having intermittent purple screen of death (PSOD) issues. This happened to me once in the lab where I needed to update one of my NIC's firmware after I upgraded the ESXi host to the most recent version.
Softly stop a VM ^
You might try to "kill" (stop) an unresponsive VM if you have remote access to your ESXi host. I'll show you a few steps how to do this:
First, connect to your host via SSH client (for example, PuTTY is just fine).
Then to get a list of all VMs running on the host, use this command:
esxcli vm process list
The output looks similar to this:
The list shows some details like World ID, Process ID, VMX Cartel ID, UUID, Display Name, and the path to the VMX Config File.
We'll need to get the World ID number.
Then we'll need to issue this command to try to shut down the VM gracefully first, for example:
esxcli vm process kill --type=soft --world-id=4308690
For type, you have three options to stop the VM:
- soft = shut down normally (softly)
- hard = immediate shutdown
- force = hard kill the VM if either of the above does not work
After doing this, you should not see any output.
Your VM shut down normally, and you can then try to start it again.
How many VMs are affected? ^
You might be in a situation where you have several VMs affected. In such a case, you should check whether any similarities exist between the affected VMs when attempting to narrow the potential scope.
You might need to check shared infrastructure (shared storage, for example) where you have the group of VMs. And also verify that this group of VMs is basically sharing this common infrastructure.
When a single VM experiences problems, it's most likely an OS problem, such as corruption of the guest OS due to a virus or ransomware, but not always. In some situations, the problem might be visible on a single VM, but over time, it affects other VMs as well.
Intermittent problems with an unresponsive VM ^
In this scenario, the VM works as expected, but from time to time, you experience performance issues, connectivity issues, or the VM does not respond at all.
One of the first things to check is to verify that your backup software isn't performing backup during the unresponsive periods.
This would mean that a snapshot operation could stun your VM for a period while copying the memory state across the network or to disk. The stun can happen during an extensive period depending on the size of the file to copy and the performance of the underlying storage infrastructure.
Subscribe to 4sysops newsletter!
Final words ^
The reasons why your VM might experience problems can vary extensively. During troubleshooting, try to isolate problems and group common problems together to find a common thread that will lead you to the real underlying cause. You can always contact VMware support and provide them with the information you collected during troubleshooting.