Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method for large neural networks that injects trainable low-rank matrices into existing weight layers while freezing the original weights. Instead of updating the full d×k weight matrix, LoRA learns a low-rank decomposition ΔW = A·B with rank r ≪ min(d,k). This drastically reduces trainable parameters, GPU memory, and optimizer state while preserving the base model’s knowledge and stability. LoRA is commonly applied to attention and feed-forward projections in decoder-only Transformers, and can be combined with quantization (QLoRA) to fine-tune models stored in 4/8-bit while training small FP16 adapters. The approach enables multi-tenant adapters, rapid iteration on domain or task skills, and safe rollback by swapping or disabling adapters without touching the base checkpoint.
What is Low-Rank Adaptation (LoRA)?
LoRA reframes adaptation as learning a constrained update to frozen weights. For a weight W0, the forward uses W = W0 + A·B·α, where A ∈ R^{d×r}, B ∈ R^{r×k}, r is small, and α scales the update. Only A and B receive gradients. Typical placements are query/key/value, output projections, and MLP up/down matrices. Rank r (e.g., 4–64), learning rate, target layers, and scaling are tuned to trade accuracy and cost. LoRA composes: multiple adapters can stack or merge, and trained adapters can be merged into the base for deployment.
Why it matters and where it’s used
LoRA enables cost-effective, rapid fine-tuning on modest hardware, supporting on-prem, regulated, or multi-tenant scenarios. Teams ship many domain-specific variants (legal, customer support, coding) by swapping adapters, keep base models stable, and reduce drift risk. Paired with QLoRA, it unlocks 7–70B class models on a few GPUs.
Examples
- Domain adaptation: add a medical adapter for clinical Q&A without altering the base.
- Style/brand tuning: adapters that enforce tone and formatting for customer replies.
- Feature flags: enable/disable adapters per tenant or task at runtime.
- Merge for export: fold A·B into W0 to deploy a single dense checkpoint.
FAQs
- How does LoRA compare to full fine-tuning? Similar quality at far lower memory/compute on many tasks, with easier rollback and versioning.
- What rank should I choose? Start small (r=8–16) and scale with task difficulty and data size; monitor validation.
- Does LoRA work with quantization? Yes—QLoRA stores frozen weights in 4/8-bit and trains FP16 adapters.
- Can I target only some layers? Yes; adapting attention and first MLP layer often gives strong returns.
- How do I deploy? Either load base + adapters at runtime or merge adapters into the weights for a single model.
