Fine Tuning LLMs - Complete Guide

Fine-Tuning LLMs with Custom Data: A Practical Production Guide

Large language models are impressive out of the box, but they struggle with domain-specific terminology, company-internal knowledge, and specialized output formats. Fine-tuning LLMs with custom data bridges that gap — transforming a general-purpose model into one that speaks your domain fluently. Therefore, this guide walks you through when to fine-tune versus prompt engineer, how to prepare training data, and how to train efficiently with LoRA and QLoRA techniques that don’t require a GPU cluster.

When to Fine-Tune vs Prompt Engineering

Before investing weeks in fine-tuning, understand when it actually helps. Prompt engineering (few-shot examples, system prompts, chain-of-thought) solves 70% of customization needs. Moreover, it’s faster to iterate and doesn’t require training infrastructure. Fine-tuning is the right choice when you need consistent output formatting (structured JSON, specific templates), domain-specific language the base model gets wrong (medical codes, legal citations, internal jargon), latency reduction (a smaller fine-tuned model can outperform a larger prompted model), or cost reduction (shorter prompts because the model already knows context).

A practical decision framework: start with prompt engineering. If you’re spending more than 500 tokens on instructions and examples in every request, fine-tuning will save money and reduce latency. Additionally, if the model consistently fails on domain-specific tasks despite good prompts, fine-tuning is the answer.

# Decision: prompt engineering vs fine-tuning
# Rule of thumb: calculate your per-request overhead

# Prompt engineering approach — 600 token system prompt every call
system_prompt = """
You are a medical coding assistant. Use ICD-10 codes.
Format: {"code": "...", "description": "...", "confidence": 0.0-1.0}
Examples:
- "chest pain" -> {"code": "R07.9", "description": "Chest pain, unspecified", "confidence": 0.95}
- "type 2 diabetes" -> {"code": "E11.9", "description": "Type 2 diabetes mellitus", "confidence": 0.98}
... (20 more examples)
"""
# Cost: 600 tokens * 10,000 requests/day = 6M tokens/day in overhead

# Fine-tuned approach — model already knows the format and domain
# system_prompt = "Extract ICD-10 codes from clinical notes."
# Cost: 50 tokens * 10,000 requests/day = 500K tokens/day
# Savings: ~90% reduction in prompt tokens

Data Preparation: Quality Over Quantity

Training data quality determines fine-tuning success more than anything else. You need 500-5,000 high-quality examples for most use cases — not millions. Each example should follow a consistent format with clear input-output pairs. Furthermore, diverse examples covering edge cases matter more than repetitive examples of common cases.

import json
import random

# Structure: each example is a conversation with system, user, assistant
training_data = []

# Load your domain-specific examples
raw_examples = load_from_database()  # Your labeled data

for example in raw_examples:
    training_data.append({
        "messages": [
            {"role": "system", "content": "You are a medical coding assistant. Return ICD-10 codes as JSON."},
            {"role": "user", "content": example["clinical_note"]},
            {"role": "assistant", "content": json.dumps(example["expected_codes"])}
        ]
    })

# Quality checks before training
def validate_dataset(data):
    issues = []
    for i, item in enumerate(data):
        msgs = item["messages"]
        # Check structure
        if len(msgs) < 2:
            issues.append(f"Example {i}: too few messages")
        # Check assistant response is valid JSON
        assistant_msg = next((m for m in msgs if m["role"] == "assistant"), None)
        if assistant_msg:
            try:
                json.loads(assistant_msg["content"])
            except json.JSONDecodeError:
                issues.append(f"Example {i}: invalid JSON in response")
        # Check for duplicates
        user_msg = next((m for m in msgs if m["role"] == "user"), None)
        if user_msg and len(user_msg["content"].strip()) < 10:
            issues.append(f"Example {i}: user message too short")
    return issues

# Split: 90% train, 10% validation
random.shuffle(training_data)
split = int(len(training_data) * 0.9)
train_set = training_data[:split]
val_set = training_data[split:]

# Save in JSONL format (one JSON object per line)
with open("train.jsonl", "w") as f:
    for item in train_set:
        f.write(json.dumps(item) + "\n")

print(f"Training: {len(train_set)}, Validation: {len(val_set)}")

Common data preparation mistakes include inconsistent formatting across examples, training on incorrect or ambiguous labels, insufficient diversity in inputs, and forgetting to validate output format. Consequently, always have domain experts review a sample of your training data before starting a training run.

Fine-tuning LLMs data preparation workflow — High-quality training data with 500-5,000 examples outperforms massive datasets with noise

LoRA and QLoRA: Efficient Fine-Tuning

Full fine-tuning updates all model parameters — for a 7B parameter model, that requires 28GB+ of GPU memory just for the model weights, plus optimizer states. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices, reducing trainable parameters by 99%. QLoRA goes further by quantizing the base model to 4-bit, letting you fine-tune a 7B model on a single 24GB GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

# QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True  # Nested quantization saves more memory
)

# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA config: which layers to adapt
lora_config = LoraConfig(
    r=16,                    # Rank — higher = more capacity, more memory
    lora_alpha=32,           # Scaling factor (usually 2x rank)
    target_modules=[         # Which layers to add adapters to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Check trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
# Output: Trainable: 13,107,200 / 7,241,748,480 (0.18%)

The key LoRA hyperparameters are rank (r) and alpha. A rank of 8-16 works for most tasks. Higher ranks capture more complexity but use more memory. Alpha should typically be 2x the rank. Additionally, targeting all attention and MLP projection layers gives the best results for instruction-following tasks.

Neural network training visualization for LLM fine-tuning — LoRA reduces trainable parameters by 99% while maintaining model quality

Training Strategies and Hyperparameters

Training a fine-tuned model isn't like training from scratch. You're adjusting an already-capable model, so aggressive learning rates destroy existing capabilities. Start with a learning rate of 1e-4 to 2e-5, train for 2-5 epochs, and use a cosine learning rate scheduler. More importantly, monitor validation loss — if it starts increasing while training loss decreases, you're overfitting.

# Training configuration for QLoRA fine-tuning
training_config = SFTConfig(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # Effective batch size: 16
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    fp16=True,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=True,                      # Pack short examples together
)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

trainer.train()

# Save the LoRA adapter (small — typically 50-200MB)
trainer.save_model("./fine-tuned-adapter")
# The base model stays unchanged — you only deploy the adapter

Evaluation: Measuring Fine-Tuning Success

Loss curves alone don't tell you if fine-tuning worked. You need task-specific evaluation metrics. For classification tasks, measure accuracy and F1 score. For generation tasks, use human evaluation alongside automated metrics like ROUGE or BERTScore. Furthermore, always compare against the base model with your best prompt to quantify the actual improvement from fine-tuning.

# Evaluation framework for fine-tuned models
def evaluate_model(model, tokenizer, test_cases):
    results = {"correct": 0, "total": 0, "errors": []}

    for case in test_cases:
        input_text = case["input"]
        expected = case["expected_output"]

        # Generate prediction
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Compare (task-specific comparison)
        if task_match(prediction, expected):
            results["correct"] += 1
        else:
            results["errors"].append({
                "input": input_text[:100],
                "expected": expected,
                "got": prediction
            })
        results["total"] += 1

    accuracy = results["correct"] / results["total"]
    print(f"Accuracy: {accuracy:.2%}")
    print(f"Errors: {len(results['errors'])} / {results['total']}")
    return results

Model evaluation metrics dashboard — Always compare fine-tuned model performance against the prompted base model baseline

Related Reading:

Resources:

In conclusion, fine-tuning LLMs with custom data is a powerful technique when prompt engineering hits its limits. Start with high-quality data preparation, use QLoRA for cost-effective training on consumer hardware, and always measure improvement against a prompted baseline. The key insight is that 1,000 carefully curated examples often outperform 100,000 noisy ones.

Fine-Tuning LLMs on Custom Data: A Developer’s Practical Guide