4405ch04 Continuous availability and manageability.fmDraft Document for Review September 2, 2008 5:05 pm
106 IBM Power 570 Technical Overview and Introduction
non-critical error is detected or if the error occurs in a resource that can be removed
from the system configuration, the booting process is designed to proceed to
completion. The errors are logged in the system nonvolatile random access memory
(NVRAM). When the operating system completes booting, the information is passed
from the NVRAM into the system error log where it is analyzed by error log analysis
(ELA) routines. Appropriate actions are taken to report the boot time error for
subsequent service if required.
One important Service Processor improvement allows the system administrator or service
representative dynamic access to the Advanced Systems Management Interface (ASMI)
menus. In previous generations of servers, these menus were only accessible when the
system was in standby power mode. Now, the menus are available from any Web
browser-enabled console attached to the Ethernet service network concurrent with normal
system operation. A user with the proper access authority and credentials can now
dynamically modify service defaults, interrogate Service Processor progress and error logs,
set and reset guiding light LEDs, indeed, access all Service Processor functions without
having to power-down the system to the standby state.
The Service Processor also manages the interfaces for connecting Uninterruptible Power
Source (UPS) systems to the POWER6 processor-based systems, performing Timed
Power-On (TPO) sequences, and interfacing with the power and cooling subsystem.
Error checkers
IBM POWER6 processor-based systems contain specialized hardware detection circuitry that
is used to detect erroneous hardware operations. Error checking hardware ranges from parity
error detection coupled with processor instruction retry and bus retry, to ECC correction on
caches and system buses. All IBM hardware error checkers have distinct attributes:
Continually monitoring system operations to detect potential calculation errors.
Attempt to isolate physical faults based on run-time detection of each unique failure.
Ability to initiate a wide variety of recovery mechanisms designed to correct the problem.
The POWER6 processor-based systems include extensive hardware and firmware
recovery logic.
Fault Isolation Registers
Error checker signals are captured and stored in hardware Fault Isolation Registers (FIRs).
The associated Who’s on First logic circuitry is used to limit the domain of an error to the first
checker that encounters the error. In this way, run-time error diagnostics can be deterministic
such that for every check station, the unique error domain for that checker is defined and
documented. Ultimately, the error domain becomes the Field Replaceable Unit (FRU) call,
and manual interpretation of the data is not normally required.
First Failure Data Capture (FFDC)
First Failure Data Capture (FFDC) is an error isolation technique that ensures that when a
fault is detected in a system through error checkers or other types of detection methods, the
root cause of the fault will be captured without the need to recreate the problem or run an
extended tracing or diagnostics program.
For the vast majority of faults, a good FFDC design means that the root cause will be
detected automatically without intervention of a service representative. Pertinent error data
related to the fault is captured and saved for analysis. In hardware, FFDC data is collected
from the fault isolation registers and ‘Who’s On First’ logic. In Firmware, this data consists of
return codes, function calls, etc.
Comments to this Manuals