Keeping all IT systems updated (patched) is not only a crucial part of a secure and operational environment but also a never-ending task for IT administrators. It is often quite a challenging and cumbersome process, especially in larger environments. The purpose of this article is to provide tips based on the experiences I have had in organizations of different sizes.

There are updates from hardware vendors, operating system vendors, antivirus vendors, and so on and so forth. Furthermore, there are line-of-business (LOB) application updates. Some are less important and can be applied once in a while; some are more important and should be applied on a regular (monthly) basis. Emergency patches should be applied out-of-band as soon as possible. If not managed properly, patching can not only become a real pain but can also have a significant business impact, for example, when untested updates break crucial applications.

Document the patch management process

First, you should have your own documentation for patch management. Each organization is a bit different and has its own procedures, policies, and needs. The documentation should include all steps performed in your patch release, testing scenarios, contact information for responsible persons, and rollback scenario.

Create and manage IT equipment inventory

You can't update something that you don't know about. I remember a case when WannaCry appeared, and all systems had to be patched immediately. Our colleagues thought they had done the job, but the infection was still there. In the end, we found out that there was a Windows XP computer inside a plastic injection (PLC) machine that nobody (from IT) knew about.

The inventory should include all servers, workstations, storage devices, routers, and so on. A simple Excel sheet might serve this purpose well.

Categorize by risk and priority

IT systems have different priorities and pose different risks. A server exposed to the Internet means a higher risk than a server located in a secure network. A server running a production application has higher priority than a file share. You should know which systems are more or less critical and focus on them accordingly.

In addition, patches have different priorities. A critical patch for the ESXi host has higher priority than a standard Windows Server cumulative update. An emergency security patch (let me use WannaCry as an example again) has even higher priority and should be applied out of band.

Define the patch release cycle

Ideally, patches should first be implemented in a non-production (test lab, development) environment to verify that they will not break anything. Cases where a patch may cause third-party application issues or system instability are endless. The non-production environment should be a mirror of your production environment.

The common patch release cycle for medium-to-large companies is:

  • Week 1 – Test and development environment
  • Week 2 – Pre-live environment
  • Week 3 – Production environment
  • Week 4 – Disaster recovery environment
Patch release cycle

Patch release cycle

Using such a cycle allows the system and application owners to evaluate possible impacts and solve any issues before applying the patches to production. Of course, if you don't have all these environments, you have to choose a different approach.

I know this might be tricky for smaller companies with few servers and desktops, but having a virtual machine or spare desktop where patches can be tested is recommended. It is always better to spend some time and resources on such tests rather than breaking your systems.

Test and evaluate system stability

After each release cycle described above, the applications should be properly tested. Some issues might not be visible at first glance. For example, a .NET-based application may show unstable behavior after .NET Framework update. The test process may be manual or automated using scripts and other tools.

In articles about this topic, you often find the claim that the security team has to test patch stability. Such a statement is far from reality, however. Companies that have their own dedicated IT security departments are, in most cases, very large companies with dozens of servers, systems, and applications. Their responsibilities are usually firewalls, antivirus, IPS systems, and so on. The security team usually informs system owners (server admins) about vulnerabilities. It is impossible for IT security team members to know each and every application in the company and thus be able to evaluate whether a patch has any impact on it.

It is the responsibility of the system and application owners to perform such tests. Each application should have its own test scenario based on its needs. Make sure you have this included in your documentation, at least for mission-critical applications.

Backups of production systems

Backing up any important data and systems is definitely a must. Some guides will tell you to create a full system backup before applying updates. This is also a bit misleading. Backups should be run regularly, and any issues should be handled when they occur. Also, backups are usually handled by different teams, and system owners usually don't even have any access to backup systems. In all the corporations I have worked in, our system team never cared about backups before running updates, as it was simply not our responsibility.

Of course, this is different if you manage everything on your own. Anyway, if you know that a Friday-night backup of your databases was successful, I see no reason to run another full (or even system) backup on Sunday before running the updates.

Configuration management

Any changes to the production environment should be properly documented. If you have a configuration management tool (like HPSM or ServiceNow), make sure to have a change ticket created for each round of updates. This can help you track any issues that may occur.

In the case of smaller companies without a ticketing system, include this information in your general IT equipment inventory document.

Roll out patches to production

Once you have tested and validated everything, you are ready to roll out the patches to the production systems. This is usually done outside of business hours (weekends) to prevent downtime and to have enough time to verify that everything went well.

If you have a virtualized environment, it might also make sense to create a virtual machine snapshot before applying the updates. This is extremely useful if you have systems that require a very quick rollback if the update fails. If you do that, don't forget the rule of thumb—snapshots should only exist for a short time in the production environment. Make sure you delete them.

Verify and report the update status

When updates are applied to hundreds of servers, usually using patching tools, some servers may fail to apply the update. In such cases, the server requires manual intervention like another reboot. It is a common habit to have a script that checks the server status (e.g., pending reboot).

Once verified, update your change ticket or IT equipment inventory with the results.

Subscribe to 4sysops newsletter!

Final words

As you can see in this article, applying updates can be quite complicated. I wrote this post based on my experience updating hundreds of Windows Server and VMware vSphere systems. As already mentioned, IT networks vary, and the patch management process might differ accordingly.

avataravataravataravatar
4 Comments
  1. ioannis (Rank 2) 2 years ago

    Very usefull piece of Information. Thank you.
    Since it is mentioned in various articles , guides etc i have to comment on the existence of a test environment. Yes having a test environment is a must and it should be used exactly like mentioned above , BUT having a test environment similar to your production one is simply impossible. A server in the test environment will never be even close to a “live” one. Even if you clone the server and make the copie available in the test environment , the administrator cannot make sure that the server is used like the live one or even test the server himself.
    We are unaware of the specific details of the applications that run on a server and only a real user or an App Admin can really tell the impact (if there was) that an update had. Real users don’t connect to our test environment. They connect to the production one.
    What are we testing then? Mostly Server stability (BSOD, reboots etc) or known issues. What kind of precautions do we take? We take snapshots or make sure that a snapshot exists. We also make sure that we take backups (or check if they exist) of crucial Applications (such as databases ) that are not covered from a snapshot.
    I also believe that no matter how big is an organisation, the administrator who performs the patch management should always be aware of the backup procedures and be always in contact with the backup team.

    • Author
      Leos Marek (Rank 4) 2 years ago

      BUT having a test environment similar to your production one is simply impossible.

      Can’t agree on this one. Its very easy to have identical servers/applications. What differs is the load. You can have automated tests which simulate standard user behavior. With LOB applications its common to have key users that do the tests in the test/pre-prod environment. Thats the only way how an functional update can be validated.
      When we patched business critical application (not only from OS point of view, but also the application itself) we always had a Key user who did the app test and a Local IT support guy who did OS related tests.
      Yes, system admins do not always know the app itself, thats why I said its responsibility of system AND app owners 🙂

      With the backups its not that easy. Its the same like the system/app owner relation… In large organizations, Windows admins have no access to backup systems (Tivoli, etc etc). You cant ask backup team if your 3000 servers that you are going to patch were backed up successfully.
      Cheers

  2. ioannis (Rank 2) 2 years ago

    I guess it all comes to our experiences so far. I still find it impossible to automate user tasks on a server. Maybe it has to do with the diversity of the applications that they are already installed. Keep in mind that if an administrator takes that road he must also test/reconfigure/update the automation processes that he already created. Yes some simple login tests can be automated but still you add up extra workload….multiply that by 1000+ …

    Concerning the pre-prod environment (when one is present) , the key users also dont want to spend extra time testing the updates on your test environment. For them it is your test environment, not theirs. “Their” pre-prod is valuable even more than the live one….and you should treat it like the production one as well :-/ I would also like to mention that sometimes the pre-prod environment does not exist in the corporate Network.More than 50% of the applications that run on our Servers come directly from external partners that just install them and update them without even sometimes letting you know.Imagine how difficult it is to keep track of the documentation, changes and the automated users tasks you mentioned before.

    With the backups themselves it depends on the administrator and how far is he willing to go.My approach to that is the following(considering that i am the sole responsible if something goes wrong):

    Every backup system has a database.All that an administrator needs is a read-only account on that database.Then with a simple SQL query you can see if there is a succesfull backup for your servers. Simple and fast (i am already doing that with Sesam, Veeam and Netbackup). Another solution would be to ask for a daily report.It is not that big deal to receive the backup report as well. Still doing the SQL thing though because i automate it with PS.

    thank you for your reply. It is always nice to have a solid conversation.

    • Author
      Leos Marek (Rank 4) 2 years ago

      Of course, as you say, it all comes to the experience, organization setup, and then your will. And many other factors.
      From my experience in past 10 years, working for a very large companies (50K+ employees) I can tell you this:
      Testing:
      Our team was responsible for VMware, Windows and production application from SAP (technically wise). This app was located in nearly 200 plants over the world. The plants produced items that were scanned by RFGuns during the process and the app did some logic during the production. Each plant was a bit different and required different modifications in the app. This was developed in house. When a plant requested modification, they were responsible to test it and say GO to put it to the production (and so also other plants could use it). So, there was our team own test env, then development env, where the coding was done. Then there was pre-prod env, where the plant users validated the mods. Then it went to production. There was even a special validation team, which had really sophisticated scripts to test all the basic functionalities.
      So from our admin point of view, we applied the patch in our own test env, checked if things work well. Then we updated dev env, where devs validated it worked well. Then a pre-prod was updated, we and key users validated… Then it went to prod and again a key user validation was required, otherwise the patch was reverted. So, some organizations (usually the really big ones, or those where the system outage means a financial loss of dozens of thousands euros per hour) do really care about such testing process.

      Backups (and other stuff):
      If I am Windows (VMware, app, whatever) admin team member, its my responsibility to manage Windows. I have time to do my job, I dont have time to check if other teams do their job properly. Thats not my role, thats management role. Also, rights are usually segregated. One team cant be responsible for everything. If I would be supposed (expected) to check backup reports, why should not I check Vmware reports, or other reports? Then I would end up checking everything for everyone before I could do my job?

      From another point of view:
      What would you do, if you are ready to patch your 1000 Windows servers, but a Friday morning backup report says that 10 servers failed to back up? And you have no response from backup team? Would you exclude those 10 servers from your (usually automated) patching process? Then you need to tell your customers (app owners) that you cant do the update and you need another downtime window? What if such thing happens next month again? Then you postpone and postpone again and you will never get our of the circle.

      So, it really depends… on many things 🙂
      Cheers

Leave a reply

Your email address will not be published.

*

© 4sysops 2006 - 2023

CONTACT US

Please ask IT administration questions in the forums. Any other messages are welcome.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account