After the host restarts, no filesystem corruption is reported and disk SMART tests pass. Which action will BEST address the root cause and return the server to a stable state?
Replace the DIMM in Channel 1 Slot 0, then rerun memory diagnostics before returning the host to production
Update the RAID/HBA firmware and force a full array consistency check
Disable all CPU C-states and Intel SpeedStep (or AMD Cool'n'Quiet) in UEFI/BIOS
Rebuild the current kernel's initramfs image and regenerate GRUB entries
The machine-check exception (MCE) shows an uncorrectable multi-bit ECC error on a specific memory channel and DIMM. Uncorrectable memory faults are reported by the memory controller to the CPU, which triggers a kernel panic to protect data integrity. Rebuilding initramfs, updating firmware, or changing CPU power states would not remove the faulty hardware. The correct remediation is to replace (or at least reseat and then retest) the DIMM identified by the error and confirm the fix with hardware diagnostics such as memtest86+ or the vendor's memory test suite. Once the bad module is replaced and memory tests pass, the host should operate normally without further kernel panics.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is ECC memory, and why is it important in servers?
Open an interactive chat with Bash
What is a Machine Check Exception (MCE)?
Open an interactive chat with Bash
What is memtest86+, and how is it used in diagnosing hardware issues?