ESXTOP is a built-in command-line tool available in the ESXi hypervisor. It is the key tool to troubleshoot performance issues since it provides real-time information about resources used on your VMs and ESXi host. If you are not yet familiar with the tool, I recommend checking out my first post ESXTOP: Analyzing vSphere performance where you can learn how to use ESXTOP and what its possibilities are.
To launch the tool, simply type the esxtop command in your Secure Shell (SSH) session. Views are available for different resource types, like CPU, memory, storage, and network. Each view shows different counters related to specific resource types.
To identify the root cause of your performance issue, you need to know more details about the most important counters and their thresholds. Today we will cover CPU and memory indicators.
The CPU view is the default one when you launch ESXTOP. You can also access it by pressing the c key anytime. It shows several global statistics on top, like the number of VMs running, the number of virtual CPUs (vCPUs) assigned to those VMs, and the physical CPU load. Refer to my first post about ESXTOP for more details.
PCPU UTIL(%) and PCPU USED(%)
These global counters show the actual load on each physical CPU (pCPU) core and the average of all. While UTIL(%) shows how the pCPU was busy (unhalted), USED(%) shows the so-called "effective work." The effective work can be lower for two main reasons:
- Hyper-threading: if both pCPUs in the core are busy, the overall work done is less than if one of them is idle.
- Power management: if the CPU frequency is reduced due to power savings, effectivity is also reduced.
Values close to 100% for all pCPUs would indicate that you are overcommitting CPU resources. To confirm this, you also need to check the %RDY values for VM worlds (described below). If only some pCPUs are close to 100%, but others are not, you might have set CPU Affinity settings for some VMs, and this might cause imbalance.
Now let's look at VM CPU counters, which indicate possible performance issues. Please note that you should always expand the VM group using the e key followed by the group identifier (GID) value to see detailed information for all VM worlds, especially if you have multiple CPU VMs.
This is a percentage of time the world (vCPU) wants to progress, but it is waiting for the CPU scheduler to schedule it. The three main reasons are:
- ESXi host overcommitment: there are a lot of vCPUs provisioned to the VMs and not enough pCPUs. In this case, you should also see PCPU counters close to 100% as explained above.
- Excessive vSMP usage: a good example is a Microsoft SQL Server VM with multiple vCPUs assigned. SQL Server runs queries with parallelism, but the host does not have enough resources (pCPUs) to schedule them.
- CPU limit configured: there is a CPU limit configured for the VM, but the VM wants to use more resources. The %MLMTD counter would indicate this as described below.
In general, you want to have %RDY close to 0 at all times. Values above 10% (per world!) would mean you are running into CPU contention. Based on the contention reason, you might need to migrate the affected VM (or other VMs) to another ESXi host with more CPU resources or remove the limit.
Anything above 0 here means there is a CPU limit configured for the VM, but the VM requires more CPU cycles. In this case, you will also see high %RDY and %CSTP values. Increase or remove the configured limit.
This counter applies only to VMs with multiple vCPUs assigned and indicates issues with coscheduling those. Again, a good example is SQL Server, which executes queries across multiple vCPUs, but the VM is waiting for their availability.
An opposite example is a VM with multiple vCPUs assigned but is running only a single-threaded application. If a VM has two vCPUs, and the application is using one vCPU to 100%, but the other is idling, the CPU scheduler will penalize the VM.
In both cases, you might need to reduce the number of vCPUs assigned. Of course you need to do this carefully while also checking guest OS performance, that is, checking whether the reduced number of cores will be enough.
The threshold here is 3%.
Note: Another cause might lead to high %CSTP values and thus performance impact—snapshots. An existing snapshot on a VM means that for any read, the VMkernel needs to go through several files to look for that correct information, and it co-stops the CPU until this I/O operation completes. Remember: snapshots should not exist for more than a few days in a production environment.
Memory (RAM) counters
Troubleshooting memory issues is a lot easier than with CPU ones. It is normal to run machines with more vCPUs assigned than the host actually has. With memory it's the opposite. You don't really want to assign more memory to VMs than you actually have available because this will force the host to use swapping, ballooning, or memory compression, usually resulting in drastic performance degradation. Most memory issues indicate that the host is either overcommitted or there is a memory limit set on the problematic VM.
Press the m key to access memory statistics. Here you will need to use the f key to add some counters since they do not display by default.
For some reason, this indicator shows up in the CPU view, but in fact it relates to memory.
It indicates the VM is waiting for swapped memory pages to load from disk. The threshold value is 5%.
A value larger than 0 indicates that VMs are forced to balloon memory. Ballooning is when one VM borrows memory from another VM guest operating system using the VMware Tools balloon driver.
%SWR/s and %SWW/s
A value higher than 1 means the host is either reading or writing to the swap file.
ZIP/s and UNZIP/s
These indicate the host is compressing or is accessing compressed memory. The threshold is 0.
Values above 80 mean you are breaking non-uniform memory access (NUMA) placement. For example, your host may have 2 physical CPUs and 64 GB RAM plugged to each CPU memory channel (128 GB total), but your VM has 96 GB RAM. In this case, a single CPU can't access this memory directly and has to use an interconnect channel between the CPUs, which is much slower than direct memory access.
Subscribe to 4sysops newsletter!
Troubleshooting poor VM performance is not always easy task. There might be an issue with the application, the guest operating system, or the host. After reading this article, you should be able to diagnose the most common performance issues: CPU and memory overcommitment.