Memory Error Investigation
Page Retirement: is a feature implemented through the fixes to bug IDs 4484338, 4504686, 4880360, and 4915531.
A memory DIMM which is experiencing repeated correctable error ((CE) single-bit) might have an increased PROBABILITY of experiencing an uncorrectable error ((UE) multi-bit). Likewise, the probability of a memory error condition that could result in system downtime also increases.
To help address this, new features have been implemented for UltraSPARC II-based, UltraSPARC III-based, and UltraSPARC IV-based systems. These features attempt to PROACTIVELY PREDICT which memory components (DIMMS) have an increased probability of experiencing an uncorrectable error, and subsequently remove this memory from future use when it is no longer used by the kernel or any processes.
CE Categories:
Intermittent/Transient Soft Error: a CE is considered intermittent if the error is not detected upon a reread of the affected memory word.
Persistent/Temporary Soft Error: a CE is considered persistent if the error is detected upon reread, but the scrubbing operation corrected it.
Sticky/Stuck-at Hard Error: a CE is considered sticky if after scrubbing, the error is still present
NOTE:
CPU receives notification that a correctable memory error has occurred using the trap mechanism (refer to pg. 14 of Solaris OS Availability Features for detailed information).
These errors are caused by memory scrubbers, which runs every 12 hours by default. Memory scrubbers find memory faults.
Solaris 8 Kernel Update patch 117000-03/Solaris 9 Kernel Update patch 112233-12: Solaris 8 Kernel Update patch 117000-03 and Solaris 9 Kernel Update patch 112233-12 implement a more aggressive method of page retirement that is successful at retiring pages under a greater range of conditions.
Page Retirement: page retirement feature enables a page of memory to be removed from use by Solaris in response to repeated ECC errors within a memory page on a DIMM.
The OS distinguishes between pages that have CE and those that have UE. A page with an UE that might be able to be cleared is marked as TOXIC. Pages mapped to a DIMM that has experienced multiple correctable errors are marked as FAILING.
If a page is marked as TOXIC, the OS attempts to clean any errors from the page using a SCRUBBING algorithm when page_free() is invoked on that page. If it can verify that there are no errors on the page after it does its SCRUBBING, it allows that page to be returned to the freelist. This ensures that a single error does not cause a page to be removed from the system. If the SCRUBBING is unsuccessful, the page is marked as failing and is immediately retired.
If a page is marked as FAILING, no attempt is made to clean the page by SCRUBBING. It is immediately retired if it is no longer in use by other threads (a page is not returned to the freelist, and so will not be used again until reboot: amount of available memory is decremented).
Aggressive Page Retirement: (Solaris 8 Kernel Update patch 117000-03/Solaris 9 Kernel Update patch 112233-12) new algorithm which successfully retires pages which are locked, dirty, or in COPY_ON_WRITE status.
Current Status
cediag/cestat was installed on xpressdev1 (cediag/cestat is a utility from Sun Solaris which diagnoses system memory errors)
cediag is scheduled to run at midnight on xpressdev1. No significant errors were found by cediag on 10/18/05 and 10/19/05.
cediag will be installed on xpress10 on 10/21/05 after trading hours
no memory errors were found since 10/16/05 on xpress10
no memory errors were found on xpressdev1 since 10/05/05
Sources:
1)
2) Solaris OS Availability Features (
3)