Fine Tuning LLMs - Complete Guide

Fine-Tuning LLMs on Custom Data: A Developer's Practical Guide

Fine-tuning is not always the answer — but when it is, the results are transformative. This guide covers the decision framework, data preparation, and training pipeline.

When to Fine-Tune vs Prompt Engineer

Prompt engineering when: you need general capabilities, your task can be described in instructions, or you have fewer than 100 examples.

Fine-tuning when: you need consistent output format, domain-specific terminology, reduced latency (shorter prompts), or the model consistently fails despite good prompts.

LoRA: Efficient Fine-Tuning

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Trainable params: 0.1% of total

LoRA trains only 0.1% of parameters. A 7B model fine-tunes on a single GPU in hours instead of days. The adapter weights are just 20MB — easy to version and deploy.

Deployment with vLLM

Serve fine-tuned models with vLLM for production throughput. PagedAttention gives you 24x higher throughput than naive inference. Merge LoRA weights into the base model for zero overhead at inference time.