During an overnight stress-test of a newly racked 2U application server, the system powers itself off roughly three minutes after CPU utilization reaches 100 percent. The BMC (baseboard management controller) logs show two events immediately before each shutdown: "Processor Thermal Trip" and "System fans set to maximum." After power is restored the machine boots normally, but the next stress-test produces the same results. Which underlying hardware problem is the most likely cause of the unexpected shutdowns?
A newly installed 10 GbE PCIe network adapter is drawing more power than the slot budget allows.
The CPU heatsink is not fully seated or lacks proper thermal paste, preventing adequate heat transfer.
One of the redundant power-supply modules is developing an internal fault that causes brief brownouts under load.
The motherboard's CMOS battery is nearing end-of-life and can no longer retain RTC settings.
The log entries point to a CPU that is exceeding its thermal limit. Modern processors will assert a thermal-trip signal and force the mainboard to shut the server down to prevent damage. If the chassis fans have already ramped to full speed yet the CPU temperature still spikes, the most plausible explanation is poor heat transfer between the CPU lid and its cooler-commonly caused by a heatsink that is not fully latched or has insufficient thermal interface material.
A dying CMOS battery usually manifests as lost date/time or BIOS settings, not immediate thermal trips. A marginal power-supply unit can cause random power loss, but the BMC would log power-rail events rather than processor thermal faults. An add-in NIC that draws excess current could trip a PCIe power alarm or the PSU, again without triggering a CPU thermal warning. Therefore, an improperly seated heatsink (or missing TIM) best matches the symptoms.