It might be surprising, but even with today’s modern server hardware, super-fast CPU’s and all the great virtualization features, the most common bottleneck you see in RDS and XenApp environments is the good old CPU bottleneck.
This post was sponsored by Smart-X. Author: Yoni Avital, CTO Smart-X
Memory is quite cheap today and you can easily have 2 gigs of RAM for each user even when running 100 concurrent users on a single server (seeing that 196GB of RAM is a common configuration these days), I/O is usually not an issue in RDS/XenApp environments (especially since SSD drives became affordable and all the fuss about VDI storage issues raised the awareness on IOPS and their importance when virtualizing Windows desktops). So, we are left with the CPU bottlenecks and peaks, especially in environments were the end users can browse the Internet directly from the RDS/XenApp session.
One of the problems with a CPU bottleneck on a RDS server is that a 100% CPU peak immediately affects all users on the server and renders the server unresponsive. There are tools like AppSense Performance Manager and RES Workspace Manager that can optimize the CPU and prevent 100% CPU peaks and you should definitely consider these if you are facing constant CPU issues, however, the aim of this article is to show you how to troubleshoot a CPU peak and find its root cause.
Traditionally, most XenApp and RDS monitoring tools are able to identify when a server is crossing a CPU threshold and send an alert to the system admin for further investigation. Unfortunately, these tools don’t usually show which process or processes caused the high CPU peak and by the time the sys admin logs on to the RDS server and opens Task Manager, the offending process might no longer be seen. This is where ControlUp comes into the picture. ControlUp is a real-time management console specifically designed for RDS and XenApp environments, containing three important features that can help troubleshoot CPU peaks and find the root cause of the issue:
The first feature, ControlUp real-time performance view, allows you to quickly pinpoint the process / application screen that cause the high CPU usage. Consider the following example:
The sys admin receives an alert about high CPU usage on server CUXEN65TS14 (a XenApp 6.5 server). The admin then switches to the ControlUp Computers View and sees the following screen:
We can quickly see the server is currently utilizing 90% CPU. In order to investigate this issue, we can simply double click this server and drill down to the Sessions view:
Now we can see that user “Administrator” is consuming a lot of CPU cycles. Now let’s double click this session in order to drill down to the Processes View:
Now we can see that iexplore.exe is the culprit. However, this information is usually insufficient. Internet Explorer processes can be used to launch internal web applications as well as browsing public Internet sites. To complete the investigation let’s right click on iexplore.exe and launch ControlUp’s “Get Session Screenshot” action to take a closer look of what’s actually happening inside the user’s session:
So now we know the end user wanted to watch the Robocop 2014 official trailer, thus causing a major CPU peak on our RDS server.
We all know that Flash based web sites are CPU hogs in RDS environments and this is a great way to find exactly which sites are causing issues in your farm. The same exact logic applies to other applications, allowing you to correlate between specific application screens and CPU peaks.
We will now cover two additional ControlUp features that can help find the root cause of CPU issues in your environment. This time, we’ll focus on analyzing CPU peaks that occurred in the past and were captured by ControlUp when you weren’t looking at the real-time display.
The first feature, ControlUp Reporter, is a free utility that allows users to analyze the CSV log files exported from ControlUp using the “Scheduled Export” feature. Using the “Scheduled Export” feature, you can export every ControlUp view to a file folder based on a pre-configured interval:
Once your export rules are configured correctly, you can launch the ControlUp Reporter, point it to the relevant file folder and start creating historical reports. Let’s review 2 reports that can help analyze CPU issues:
The ‘CPU Usage by Computer’ report will produce a graph of the CPU usage over time:
We can then use the ‘Computer Processes by Sample’ report to see the processes list in a certain point in time:
As you can see, the CPU peak in this case was mainly caused by the HP InsightServer Agent. Interestingly enough, this is a virtual machine that was migrated using a P2V tool, and the customer simply forgot to remove the physical hardware agents from the virtual image… In this case, a simple uninstall of the unneeded software agent has saved the customer a substantial amount of precious CPU cycles.
The second feature, to be included in ControlUp 2.5, scheduled for beta release during Q4 2013, is called
ControlUp Follow-up Actions. This feature will allow the end user to configure follow-up actions after an alert threshold is breached. For example, you can configure an alert trigger whenever the CPU usage on a RDS server is higher than 90% for more than 60 seconds. Once this threshold is crossed, the new “Follow-up Action” feature can dump the relevant ControlUp views to disk:
ControlUp views are saved in CSV format and can opened using MS Excel software, this will allow you to do a post mortem analysis of the issue and find exactly which processes caused the CPU peak.
To summarize, CPU peaks are still quite common in RDS and XenApp (soon XenDesktop 7) environments. Due to the nature of most CPU issues, it is important to understand which Internet sites/ application screens are causing the CPU spikes and to be able to analyze these issues in real-time. In this article we have shown how ControlUp can assist in troubleshooting and ultimately preventing CPU peaks, thus leading to a better user experience.