An enterprise team must adapt a 13-billion-parameter transformer-based large language model to its proprietary support-ticket corpus. Requirements are:
Keep all original model weights frozen for compliance review.
Add and train no more than about 1 % extra parameters to minimize GPU memory during training.
Once fine-tuning is complete, incur zero additional inference latency because any extra parameters will be merged into the base weights.
Which parameter-efficient adaptation technique best satisfies all three of these constraints by inserting trainable low-rank matrices into each transformer layer during fine-tuning?
Low-Rank Adaptation (LoRA)
Knowledge distillation into a smaller student model
Low-Rank Adaptation (LoRA) freezes the pretrained transformer weights and learns a pair of small rank-decomposed matrices that approximate the weight update for each linear projection. Only these low-rank matrices are trained, so the number of new parameters is typically far below 1 % of the base model, meeting the memory constraint. During deployment the low-rank update can be merged back into the frozen weight matrix, eliminating any runtime overhead and satisfying the zero-latency requirement.
Prefix tuning also leaves the base weights untouched, but it prepends learnable key/value vectors that must still be concatenated at inference, adding a (small) latency penalty. Knowledge distillation trains a separate, smaller student network rather than adapting the original model, so the base weights would not remain frozen. Dynamic token pruning accelerates inference by discarding low-importance tokens and does not adapt model knowledge at all. Consequently, LoRA is the only choice that fulfills every stated requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does LoRA mean and how does it work?
Open an interactive chat with Bash
Why is keeping the original model weights frozen important?
Open an interactive chat with Bash
How does LoRA compare to Prefix Tuning for parameter efficiency?