A data science team retrains a convolutional neural network every Monday using an AWS p4d.24xlarge instance (8 × A100 GPUs). The current job runs in 3 hours at the on-demand rate of about US$ 32.77 per hour, so each weekly training run costs roughly US$ 98. The VP of Finance requires that the compute bill for this job be cut by at least 50 %, while the ML lead insists that wall-clock training time must drop below 2 hours and model accuracy must not change. The training code saves checkpoints every five minutes and can resume automatically if the instance is reclaimed.
Which adjustment to the training pipeline best meets all of the new constraints with the least engineering effort?
Switch to a shallower CNN (e.g., ResNet-34 instead of ResNet-152) with early stopping on the current on-demand p4d setup.
Keep using the on-demand p4d instance but switch to mixed-precision training with gradient accumulation.
Run the job on p4d spot instances and enable mixed-precision training.
Replace the p4d with a c6i.32xlarge CPU-only instance and keep single-precision training.
Using p4d spot capacity typically lowers the hourly rate by 50-70 %. Combining that discount with mixed-precision (FP16/FP32) training-which commonly shortens GPU training time by 30-60 %-cuts the total bill from ~US$ 98 (3 h × US$ 32.77) to about US$ 23-39. For example, a 40% time reduction (to 1.8h) and a 60% spot discount (to ~\(13/hr) would result in a final cost of about US\) 24. This easily exceeds the 50 % savings target while finishing in well under 2 hours and leaving model architecture unchanged.
Keeping the on-demand instance and only enabling mixed-precision reduces runtime to ~1.8 hours, but the bill would still be roughly US$ 59, which fails to meet the 50% savings goal. Moving to a large CPU instance may be cheaper per hour but training would take far longer than 2 hours. Replacing the current network with a shallower model plus early stopping could hit the time goal, yet it risks accuracy changes and demands significant re-engineering. Therefore, choosing spot capacity together with mixed-precision training is the most effective, low-effort solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are spot instances on AWS?
Open an interactive chat with Bash
What is mixed-precision training?
Open an interactive chat with Bash
Why is checkpointing important for spot instance workloads?