This article gives tips about performance monitoring of virtual servers and demonstrates how to watch performance counters with Smart-X ControlUp.
This post was sponsored by Smart-X Software Solutions.
Performance monitoring and management on virtualized computers presents several challenges to sys admins:
- In large organizations, virtual machines tend to spawn at a barely controllable rate, and often hardware capacity planning just cannot take into account all of the existing growth.
- Keeping an eye on performance counters on multiple machines is a lot of work. Although numerous methods and tools are available, sometimes there’s a need for a multi-target “Task Manager” to overlook a large environment.
- Performance degradation might come and go, making it even harder to pinpoint the root cause. That’s where visibility and real-time control are crucial.
Although possible root causes are countless, three suspects account for the most common performance issues in nearly every enterprise:
Suspect #1 – CPU Stress
Perhaps the most basic and trivial measure of performance is CPU utilization, which often spikes and results in slow response times for users. The challenge here is twofold – locating the CPU-intensive machine and pinpointing the application that is responsible for the high processor load.
In this example, we use ControlUp’s information grid to sort machines by CPU usage, which quickly brings to our attention those machines currently under CPU stress.
OK, now what? Traditional tools, such as Task Manager, allow the admin to locate the process that is currently consuming excessive processor cycles and… well, here the options become limited. Killing the problematic process will surely end the crisis, but how do we gain some insight into the issue before blindly terminating its cause?
With ControlUp, you can request a screenshot of the user session in which the process runs, thus answering the question “What is happening exactly?”
ControlUp sessions screenshot
For example a systems administrator overlooking a large Terminal Server farm is concerned with high CPU usage by a ubiquitous process named IEXPLORE.EXE. Before acting upon the issue, the admin wants to know what is it exactly that the users are doing that causes Internet Explorer to devour CPU power. By requesting a screenshot of the user’s activity, it turned out that the high CPU usage was due to work-unrelated Flash-based websites.
Suspect #2 – Storage / Storage communication
Whether your hosts are connected to the storage using Fiber Channel or iSCSI, whether you use SSD, SAS or SATA, you probably face storage performance issues from time to time.
But, before opening a support ticket at the storage’s vendor, you should probably obtain some evidence which will cause the support guy to actually look for the problem, rather than dismiss you by saying “The storage works perfectly”.
The best way to prove storage problems would be to measure the ‘Average Disk Queue’ counter on your virtual machines. In order to pinpoint whether the bottleneck is affecting disk writes or reads (or possibly both), you can take a look at the ‘Avg. Disk Read Time’ and ‘Avg. Disk Write Time’ counters.
However, since the shortage may be localized or systematic, you must look at this counter on as many virtual machines as possible concurrently. While troubleshooting performance issues, many useful insights come from answering the following question: “What is common to all machines currently experiencing the issue?”
The answer may point to a faulty cache disk on one of the storage arrays, a virtualization host gone bad or a memory leak in some esoteric application.
In order to demonstrate some typical conditions, we will use ControlUp, which is especially handy when you need to compare performance counters from multiple machines.
An obvious I/O performance issue
An additional look at the metrics above – and it’s easy to tell that in this case, disk read times are generally higher than disk write times. That means these servers are demanding data from the disks, which are too slow to provide it in time. The bottleneck here is rather severe, so if these are terminal servers, users would experience slow logins and if these are workstations, users will wait forever for their favorite apps to launch.
Disk performance monitoring is complex, and different vendors or experts may recommend different counters to look at. However, it is generally the case that if the average read/write times are high, there is no excuse other than degraded storage performance (or a really massive demand for storage resources, which is one more thing about which you would probably like to be informed).
Suspect #3 – Memory shortage on the host
Physical memory is becoming cheaper and new servers come with larger and larger amounts of RAM. However, physical memory is still a bottleneck, especially on multi-user systems. Since Windows 2008, when opening Task Manager, it always seems as if there is a memory shortage, even though the system is functioning perfectly well. The reason for this is the fact that the operating system utilizes the memory in a more efficient way, caching in everything needed, but doesn’t cache out if there is still enough RAM available.
Therefore, looking at the %Free RAM counter doesn’t say much. The system starts to slow down only when it has to go to the hard disk and fetch memory from there.
The counter that best shows these swapping operations (called Hard Page Faults) is the Pages-Per-Second counter. When a computer reports large number of page faults, the entire system is experiencing delays.
An obvious memory shortage
Notice that our hypothesis is supported by the fact that the “Memory Utilization” metric indicates that almost all memory is utilized.
When you encounter this kind of problem, you should add physical RAM to the computers. If you don’t have enough RAM on your host, try moving some virtual machines to other hosts or buy some more RAM.
With virtualized environments, the number of factors that affect performance grows to include storage solutions, network performance, VM density and various optimization mechanisms. All these make the comparison between resource consumption of systems at different sites potentially unreliable.
Although some rules-of-thumb and estimated thresholds are provided for detecting performance bottlenecks, it is never appropriate to treat these numbers as absolute. When monitoring complex systems, the key concept is baseline. Without knowing how your machines perform at typical loads, at peak as well as off-peak times, you will never know what are the thresholds you should establish to detect deviations from the normal state of affairs.