- OpenVPN IPv6 and IPv4 configuration - Mon, Mar 1 2021
- 4sysops author and member competition 2020 - Fri, Jan 1 2021
- Assign an IPv6 address to an EC2 instance (dual stack) - Tue, Dec 15 2020
Physical memory failures do not happen often, but when they occur they are fatal. Sometimes they won't just crash the operating system but will also cause hard disk corruption. In the worst case, a destroyed database is the end result. This is just because one of the billions of memory cells confused a zero with a one. ECC (Error-Correcting Code) memory improves fault tolerance, but it is expensive and so are the supporting motherboards. And, of course, ECC doesn't guarantee that RAM failures won't occur.
Memory diagnostic tools exist for Windows. However, testing RAM while a full-blown operating system has been loaded doesn't make much sense because too many memory cells can't be probed that way. Thus, it is important to use a standalone memory diagnostic tool that allows you to boot up from a CD or a USB stick.
Of course, you can also use the memory diagnostic tool of the BIOS. You can typically choose between a quick and a thorough memory test. However, even if the BIOS confirms that your memory is okay, this does not guarantee that all memory cells are working properly.
For one, BIOS memory diagnostic tools usually can't find intermittent memory problems—that is, problems that only occur at specific conditions, such as when two adjacent memory cells affect each other. Moreover, old memory modules in particular sometimes only fail at certain temperatures. Frequent temperature changes make silicon brittle, causing micro fractures on the chip. Those micro fractures often only produce problems at very specific temperatures because of the uneven expansion of the chip at different temperatures.
Hence, I recommend starting the memory diagnostic tool when the machine is still cold and then running the test for 20 minutes or so until the computer has reached its operating temperature. Unfortunately, even then you can't be 100% sure that all memory cells are working properly. Since memory diagnostic tools use different algorithms to probe memory cells, it can't be wrong to obtain multiple opinions.
The ideal way for diagnosing memory is to write a certain value (1 or 0) to a memory cell, then write the opposite value to all adjacent cells and probe the original cell to see if it still holds the right value. This method ensures that writing to a memory cell doesn't affect adjacent cells, which is often the cause of intermittent errors.
The problem is that different chip designs make it difficult to determine adjacent memory cells. Memory diagnostic tools thus work with strategies that approximate this testing method. Usually they fill the memory with certain patterns, verify that the pattern has been written correctly, and then do the same with the pattern's complement. This still doesn't guarantee that a memory chip is flawless, but it increases the likelihood considerably.
If the memory diagnostic tool finds an error, it can be unclear sometimes which memory module contains the corrupt cells. In this case, the best way to find the faulty module is to test all modules independently by inserting them one by one into the computer. If this isn't possible, for instance because the board requires an even number of modules, you can rotate modules and see if the memory diagnostic tool reports the error at different addresses. Another option is to replace a single module and then check if the error recurs.
If you are uncertain whether a memory module has flaws or not, replacing it is usually the best option. It certainly depends on the importance of the corresponding machine, but in most cases the crash of a productive server is more expensive than buying new RAM. In any case, I recommend testing the memory of old machines every now then, for instance if you have to reboot the server anyway. In my next post, I will review a free memory diagnostic tool that can be used for this purpose.