Too often, IT administrators rely purely on ticket submissions to determine whether Windows clients work reliably. What I want to show you here is how you can use PowerShell to monitor system stability on multiple computers by tracking reliability indicators such as applications crashes, hanging applications, and blue screens of death (BSODs).

Micah Rairdon

I am a PowerShell fanatic currently working in Windows client management at Haworth based out of Holland, MI. I tinker around and document my findings at tiberriver256.github.io and you can find me on Twitter @tiberriver256.

With the little PowerShell script discussed below, you can remotely retrieve reliability information and visualize the data with the free PowerBI Desktop tool.

Visualizing system stability with PowerBI Desktop

Visualizing system stability with PowerBI Desktop

However, before I detail the PowerShell solution, let's look at the method that admins typically use when they want to monitor the stability of a Windows computer. This will help us to understand what kind of reliability data is available.

Reliability Monitor ^

Reliability Monitor is a handy little tool built into Windows since Vista. The tool contains a whole lot of helpful information when it comes to troubleshooting a Windows computer. It can be a bit overwhelming when you look at it the first time. I will break it down here for you briefly.

Reliability Monitor

Reliability Monitor

Blue line across the top (top wow): This is your system stability index. It is basically a scoring system based on how often your computer experiences failures. The scoring system ranges from 1 to 10. The more often your computer fails, the lower your score. The longer you go without a system or application failure, the higher your score will be.

Application failures: Every time you have an application failure, which can be an application crashing or hanging, it will show a red "x" in that column.

Windows failures: This column will get a red "x" when you have a BSOD.

Miscellaneous failures: These are when the system unexpectedly loses power. The power button might force a shutdown or possibly the battery could run completely out.

Warnings: These do not impact your stability score but provide good information. They will show when an application installation/removal, Windows Update, or driver update was unsuccessful.

Information: This column will get a blue "i" when there are system changes you should be aware of. Driver installation, Windows Updates, and software installations will all appear in this column. This information can be very handy when troubleshooting what caused a failure.

Clicking on any of the columns will give you more detailed information about the abovementioned events. This information is great. The problem is that I cannot remotely log in to every computer in an enterprise environment and check every one of these PCs. I could try to figure out the events that trigger these reliability records and pull them in, but I would have to recreate the scoring system.

The solution? Read on.

win32_ReliabilityStabilityMetrics, win32_ReliabilityRecords ^

A couple of WMI classes store all the scorings and records discussed above. You will want to collect the following properties from the WMI classes:

Win32_ReliabilityStabilityMetrics

  • TimeGenerated: The system calculates the stability index score every hour the computer is on and will record the associated timestamp in this property
  • SystemStabilityIndex: This is the stability score index calculated

Win32_ReliabilityRecords

  • EventIdentifier: The ID for the event in the Windows Event Log
  • Message: The body of the Windows event associated with the failure or change
  • ProductName: The product name or executable associated with the failure
  • SourceName: This designates what type of event we are looking at and will always be one of the following:
    • Application Error: Application stops responding and crashes
    • Application Hang: Application stops responding but recovers
    • Application-Add-On-Event-Provider: Add-ons were enabled for Internet Explorer
    • EventLog: The only event I have seen from this source is "The system was shut down unexpectedly"
    • Microsoft-Windows-Setup: Occurs when Windows is first installed
    • Microsoft-Windows-StartupRepair: Windows failed to boot and a startup repair was attempted
    • Microsoft-Windows-UserPnp: Driver-related events
    • Microsoft-Windows-WER-SystemErrorReporting: Blue screen of death
    • Microsoft-Windows-WindowsUpdateClient: Windows Updates
    • MsiInstaller: Application installations and removals
  • TimeGenerated: See above
  • User: The user account active during the event

Enterprise client management systems such as Microsoft's System Center Configuration Manager (SCCM) or Symantec's Symantec Management Platform can inventory these classes. However, if such tools are not available in your environment, you can use PowerShell, a CSV file on a network share, and PowerBI Desktop to collect and analyze the data. You could easily adapt the reporting process to use a database as a source in place of the CSV.

Collecting system stability data with PowerShell ^

You can use the simple script below to collect data from a list of computers over the last 30 days.

The script uses the Get-CimInstance cmdlet to query the WMI classes remotely on the computers stored in an array. It then exports the reliability records and the stability metrics into a CSV file in the Documents folder.

Building the PowerBI Report ^

Next, you have to download PowerBI Desktop, the report builder portion of the PowerBI product. You can use this for free without registering for an account. After installing it, you can then import both CSVs created from your PowerShell commands using Get Data in PowerBI Desktop.

Now you can start creating charts based on the data you have collected. I will walk you through creating a couple of easier ones I have found useful. The procedure becomes even more useful when you relate this data to hardware and operating system inventory information.

Average of System Stability Index

I use the System Stability Index to spot major drops in stability in the environment and watch it to make sure "fixes" pushed to clients are making a difference.

Average of System Stability

Average of System Stability

Trending Events

Trending Events are useful for correlating changes to failures.

Trending Events

Trending Events

With just three PCs it's hard to tell, but I will show you an example from the dashboard I have built for our environment. We recently rolled out an update to SnagIt that is causing issues. The correlation of failures to the installation events from MSIInstaller are pretty clearly visible.

Trending SnagIt issue root cause

Trending SnagIt issue root cause

We also use this indicator to rule out updates as a possible cause of crashing. The chart shows that Excel was crashing just as often before the latest update installation events as it was after.

Trending issue Excel root cause

Trending issue Excel root cause

Conclusion ^

It is important keep an eye on reliability and stability indicators to fix problems before end users start reporting them. With the help of PowerShell, you can get quickly get an overview of the troubles that are building up in your network.

Win the monthly 4sysops member prize for IT pros

Share
4+

Users who have LIKED this post:

  • avatar

Related Posts

13 Comments
  1. Matt D. 8 months ago

    Great article...can't wait to give this a try and be more proactive!

    1+

    Users who have LIKED this comment:

    • avatar
    • Author
      Micah Rairdon 8 months ago

      Thanks Matt! I'm really curious about what reliability score other people are averaging at. We seem to be between 7.5 and 8.

      0

  2. […] Monitoring Windows system stability with PowerShell […]

    0

  3. William Batzle 8 months ago

    YouTube the heck out of this. Trying to use it myself now and it is working really well.

    1+

    Users who have LIKED this comment:

    • avatar
    • Author
      Micah Rairdon 8 months ago

      Thanks for the comment William. Glad it's working out well for you! I have not really done much on YouTube but I'm curious to know what else you would like to see.

      0

      • William Batzle 8 months ago

        I haven't gotten this to work the way you did but man is this an awesome script and BI is really cool to use. I can't wait to learn more on how to get it working in the same manner.

        1+

        Users who have LIKED this comment:

        • avatar
  4. CMI 8 months ago

    Hello Micah,

    Excellent article, very informative. Thank you for writing this. Do you have thoughts how does this scale? Would it scale to around 200 machines or more?

    Thank you.

    CMI

     

    0

    • Author
      Micah Rairdon 8 months ago

      Hey CMI,

      Great question! It will definitely scale. The queries are pretty low impact on the clients they are requesting info from and the reporting of course is pretty simple data.

      We moved to using our client management solution Symantec's SMP to collecting the information from the WMI classes because we needed to scale it globally from 5k clients and collect info when off the corporate network. SCCM would also let you inventory the two WMI classes you need.

      As long as the machines are pingable from the computer you are running this script from it should scale fine to at least 200.

       

      0

  5. Matt 8 months ago

    @('Computer1','Computer2','Computer3') is an array, not a hash table.

    Other than being nick picky about that, I want to say thanks for sharing this.

    1+

    Users who have LIKED this comment:

    • avatar
    • Michael Pietroforte 8 months ago

      I think this was my bad. I corrected the article. Thanks for the hint.

      2+

      Users who have LIKED this comment:

      • avatar
  6. Paolo Maffezzoli 8 months ago

    Nice article. I have to try it.

    1+

    Users who have LIKED this comment:

    • avatar
  7. Aries 2 months ago

    This is fabulous!!

    My various W10 machines only seem to collect 30 days worth of data but my older OS machines collect a year of data. Is that configurable?

    1+

    Users who have LIKED this comment:

    • avatar
    • Author
      Micah Rairdon 2 months ago

      Hi Aries,

      Glad you liked it! I have not been able to find any documentation on how to change how long it keeps that data, if you find it let me know though!

      We ended up just working with the 30 days of data to monitor existing issues more than trend over-time. If management gets more interested the plan was to begin archiving the data off somewhere so we could trace overtime better.

      0

Leave a reply

Your email address will not be published. Required fields are marked *

*

CONTACT US

Please ask IT administration questions in the forum. Any other messages are welcome.

Sending
© 4sysops 2006 - 2017

Log in with your credentials

or    

Forgot your details?

Create Account