A 2U virtualization host suddenly shuts itself off after running CPU-intensive workloads for about ten minutes. When power is restored, the server's management controller shows entries similar to the following:
15:42:07 CPU1_Temp 100 °C - critical threshold exceeded
15:42:08 System action: Thermal trip - power shutdown
The ambient rack temperature is a steady 22 °C, firmware is current, and no OS errors are logged. Which hardware issue is the MOST likely root cause of these unexpected shutdowns?
The processor's cooling fan has stopped working, severely reducing airflow through the heatsink
The motherboard's CMOS battery is nearing end-of-life and no longer holds BIOS settings
A DIMM is intermittently throwing correctable ECC errors under heavy memory load
The RAID controller's cache battery is degraded, forcing the array into write-through mode
The management controller reports that the CPU temperature exceeded its critical limit, triggering the hardware's thermal-protection shutdown. In a properly cooled rack, the component most likely to allow a processor to reach 100 °C in only a few minutes is a failed or stalled fan (or fan cage) that normally moves air across the CPU heatsink. Without adequate airflow, heat cannot be dissipated and the server's safeguards immediately throttle, then shut down, the system to prevent permanent damage.
A depleted CMOS battery, a RAID-cache battery, or intermittent ECC memory can cause other symptoms such as loss of BIOS settings, degraded write-back caching, or correctable memory errors, but none of those conditions will drive the processor package to a critical thermal reading within minutes of heavy load.