Fine-Tuning LLMs on Custom Data: A Developer's Practical Guide
Fine-tuning is not always the answer — but when it is, the results are transformative. This guide covers the decision framework, data preparation, and training pipeline.
When to Fine-Tune vs Prompt Engineer
Prompt engineering when: you need general capabilities, your task can be described in instructions, or you have fewer than 100 examples.
Fine-tuning when: you need consistent output format, domain-specific terminology, reduced latency (shorter prompts), or the model consistently fails despite good prompts.
LoRA: Efficient Fine-Tuning
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Trainable params: 0.1% of total
LoRA trains only 0.1% of parameters. A 7B model fine-tunes on a single GPU in hours instead of days. The adapter weights are just 20MB — easy to version and deploy.
Deployment with vLLM
Serve fine-tuned models with vLLM for production throughput. PagedAttention gives you 24x higher throughput than naive inference. Merge LoRA weights into the base model for zero overhead at inference time.