During an early design iteration, your team is fine-tuning a 250-million-parameter Transformer on a single 24 GB GPU. When you raise the mini-batch size from 16 to 64, training fails with an out-of-memory (OOM) error, and the budget does not allow additional hardware. You have one day to rerun the experiment and want to keep the architecture and hyperparameter search results unchanged. Which change to the training configuration is the most appropriate way to satisfy the resource constraint while minimizing impact on model accuracy and development time?
Enable mixed-precision (FP16/bfloat16) training with automatic loss scaling.
Pad every input sequence to exactly 512 tokens so tensor shapes are consistent across batches.
Double the model's hidden dimension but freeze all even-numbered layers to reduce gradient updates.
Replace the AdamW optimizer with standard SGD without momentum to eliminate optimizer state.
Enabling mixed-precision (FP16 or bfloat16) training halves the memory required for every weight, activation, and gradient tensor while typically preserving accuracy when automatic loss scaling is used. The change is a single-flag adjustment in modern frameworks, so it can be implemented quickly.
Replacing AdamW with plain SGD would remove Adam's two extra moment tensors and save some memory, but the gain is smaller than the 2× reduction from mixed precision and would likely require new hyperparameter tuning, risking schedule delays and accuracy loss.
Padding all sequences to a fixed 512-token length increases, rather than decreases, memory usage because every shorter sequence now consumes the maximum length.
Doubling the hidden size-even if half the layers are frozen-adds parameters and activations, increasing peak memory; freezing layers only skips gradient updates, not the forward activations that trigger the OOM.
Therefore, mixed-precision training is the most effective, low-risk remedy for the GPU memory constraint in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is mixed-precision (FP16/bfloat16) training?
Open an interactive chat with Bash
Why does padding input sequences to 512 tokens increase memory usage?
Open an interactive chat with Bash
What is automatic loss scaling, and why is it important in mixed-precision training?