After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
A 24-hour stress test (using tools like HPL or NCCL) is designed to push the thermal and electrical limits of a DGX system. To verify a 'Pass,' the administrator must ensure that the hardware maintained its performance targets without degradation. Consistent GPU utilization >95% confirms that the workload successfully saturated the compute cores for the entire duration. Crucially, the absence of thermal throttling events (verified via nvidia-smi -q -d PERFORMANCE) ensures that the system's cooling solution (fans and heatsinks) is adequate for the environment; if throttling occurred, the GPUs would have slowed down to protect themselves, indicating a potential cooling failure or environmental heat issue. While power consumption (Option D) and CPU usage (Option A) are interesting, they are not the primary indicators of 'Stability' under extreme AI training loads. System stability is defined by the ability to run at peak speeds indefinitely without hardware-level interventions or slowdowns.
Virgina
4 hours agoStacey
5 days agoCasie
10 days ago