Fine-Tuning LLMs on Custom Data: A Developer’s Practical Guide

Fine-Tuning LLMs on Custom Data: A Developer's Practical Guide

Fine-tuning is not always the answer — but when it is, the results are transformative. This guide covers the decision framework, data preparation, and training pipeline.

When to Fine-Tune vs Prompt Engineer

Prompt engineering when: you need general capabilities, your task can be described in instructions, or you have fewer than 100 examples.

Fine-tuning when: you need consistent output format, domain-specific terminology, reduced latency (shorter prompts), or the model consistently fails despite good prompts.

LoRA: Efficient Fine-Tuning

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Trainable params: 0.1% of total

LoRA trains only 0.1% of parameters. A 7B model fine-tunes on a single GPU in hours instead of days. The adapter weights are just 20MB — easy to version and deploy.

Deployment with vLLM

Serve fine-tuned models with vLLM for production throughput. PagedAttention gives you 24x higher throughput than naive inference. Merge LoRA weights into the base model for zero overhead at inference time.

Scroll to Top