Prompt Caching and Optimization Techniques for LLM Applications

Prompt Caching and Optimization for LLM Applications

Prompt caching optimization is one of the most impactful techniques for reducing costs and latency in production LLM applications. As teams move from prototypes to production, API costs often become the biggest line item — and the majority of tokens sent to LLMs are repetitive system prompts, few-shot examples, and context documents that rarely change. Caching these effectively can reduce API spend by 80-90%.

This guide covers every layer of the caching stack: from provider-level KV cache reuse to application-level semantic caching, prefix matching, and prompt compression. You will learn practical techniques that work with Claude, GPT-4, and open-source models alike.

Understanding LLM Caching Layers

LLM caching operates at multiple levels, each offering different cost and latency benefits. Understanding these layers helps you design an optimal caching strategy for your specific use case.

At the lowest level, providers like Anthropic and OpenAI offer KV cache reuse — when the prefix of your prompt matches a recent request, the provider skips recomputing attention for those tokens. This is automatic and requires no code changes, but only helps when identical prefixes are sent within a short time window.

Above that, application-level caching stores complete responses keyed by the prompt (exact or semantic match). This layer is where you have the most control and can achieve the largest cost savings. Finally, prompt compression techniques reduce token count before sending, which benefits both cached and uncached requests.

Prompt caching optimization AI architecture
Multi-layer caching architecture for LLM applications

Provider-Level Prompt Caching

Anthropic’s prompt caching feature (available since late 2024) lets you explicitly mark sections of your prompt as cacheable. Cached tokens are charged at 10% of the normal input token rate, and subsequent requests that hit the cache have significantly lower latency.

import anthropic

client = anthropic.Anthropic()

# The system prompt and context are cached across requests
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal contract analyst...",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text",
            "text": LARGE_LEGAL_CONTEXT,  # 50K tokens of legal references
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Analyze clause 4.2 of the attached contract."}
    ]
)

# Check cache performance
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")

The key insight is to structure your prompts so that the cacheable portion (system prompt, reference documents, few-shot examples) comes first, followed by the variable user input. This maximizes prefix overlap and cache hit rates.

Application-Level Semantic Caching

Exact-match caching only helps when users ask the identical question. Semantic caching extends this by finding responses to similar questions using embedding similarity. This is particularly effective for customer support bots and FAQ systems where users ask the same questions in different ways.

import hashlib
import numpy as np
from redis import Redis
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.redis = Redis(host='localhost', port=6379, db=0)
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold
        self.embeddings = []
        self.keys = []

    def _get_embedding(self, text):
        return self.encoder.encode(text, normalize_embeddings=True)

    def get(self, prompt):
        """Check if a semantically similar prompt was cached."""
        query_emb = self._get_embedding(prompt)

        if not self.embeddings:
            return None

        # Compute cosine similarities
        similarities = np.dot(self.embeddings, query_emb)
        max_idx = np.argmax(similarities)
        max_sim = similarities[max_idx]

        if max_sim >= self.threshold:
            cached = self.redis.get(self.keys[max_idx])
            if cached:
                return cached.decode('utf-8')
        return None

    def set(self, prompt, response, ttl=3600):
        """Cache a response with its embedding."""
        key = hashlib.sha256(prompt.encode()).hexdigest()
        self.redis.setex(key, ttl, response)
        self.embeddings.append(self._get_embedding(prompt))
        self.keys.append(key)


# Usage
cache = SemanticCache(similarity_threshold=0.92)

def ask_llm(prompt):
    # Check cache first
    cached = cache.get(prompt)
    if cached:
        print("Cache hit!")
        return cached

    # Call LLM
    response = call_claude(prompt)
    cache.set(prompt, response)
    return response

# These would hit the same cache entry:
ask_llm("How do I reset my password?")
ask_llm("I forgot my password, how to reset it?")
ask_llm("password reset process")

Choosing the Right Similarity Threshold

The similarity threshold is critical. Too low (0.80) and you will return incorrect cached responses. Too high (0.98) and the cache rarely hits. In practice, 0.90-0.95 works well for most applications. Furthermore, you should track cache hit rates and sample cached responses weekly to ensure quality remains high.

AI optimization caching architecture diagram
Semantic similarity matching for intelligent response caching

Prompt Compression Techniques

Even with caching, reducing the token count of uncached prompts directly reduces costs. Several techniques help compress prompts without losing important information.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True,
)

# Original prompt: 2000 tokens
original_prompt = """
Given the following customer support conversation history...
[long context with 50 messages]
...
Please summarize the key issues and suggest resolutions.
"""

# Compressed prompt: ~600 tokens (70% reduction)
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.3,  # Keep 30% of tokens
    force_tokens=["summarize", "issues", "resolutions"],  # Never remove these
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']:.2f}")

# Send compressed prompt to LLM
response = call_claude(compressed['compressed_prompt'])

Prompt Caching Optimization in Production Architecture

A production caching system combines multiple layers. Here is a reference architecture that handles thousands of requests per second while minimizing LLM API costs:

class ProductionLLMGateway:
    def __init__(self):
        self.exact_cache = Redis(host='cache-host', port=6379, db=0)
        self.semantic_cache = SemanticCache(threshold=0.93)
        self.rate_limiter = TokenBucketLimiter(tokens_per_sec=50000)

    async def complete(self, request):
        # Layer 1: Exact match (fastest)
        exact_key = self._hash_prompt(request.prompt)
        cached = self.exact_cache.get(exact_key)
        if cached:
            return CacheResult(cached, source="exact")

        # Layer 2: Semantic match
        semantic = self.semantic_cache.get(request.prompt)
        if semantic:
            return CacheResult(semantic, source="semantic")

        # Layer 3: Provider with prompt caching headers
        await self.rate_limiter.acquire(request.estimated_tokens)
        response = await self._call_provider(request)

        # Store in both caches
        self.exact_cache.setex(exact_key, 3600, response.text)
        self.semantic_cache.set(request.prompt, response.text)

        return response

When NOT to Use Prompt Caching

Prompt caching is not appropriate for all LLM use cases. Avoid caching when responses must reflect real-time data — stock prices, live inventory, or breaking news. Cached responses become stale immediately in these scenarios. Additionally, creative writing tasks where variety is desired should bypass caching since users expect different outputs for the same prompt.

Semantic caching specifically should be avoided for high-stakes decisions (medical, legal, financial) where even a 5% semantic difference could lead to a meaningfully wrong cached response. In these domains, always call the LLM fresh and use exact-match caching only for identical inputs.

AI machine learning production optimization
Production LLM gateway with multi-layer caching for cost optimization

Key Takeaways

  • Prompt caching optimization can reduce LLM API costs by 80-90% by eliminating redundant token processing
  • Structure prompts with static content first (system prompts, context) and variable content last to maximize provider-level cache hits
  • Semantic caching with embedding similarity (threshold 0.90-0.95) catches paraphrased questions that exact matching misses
  • Prompt compression using LLMLingua can reduce token counts by 50-70% without meaningful quality loss
  • Layer multiple caching strategies: exact match first, then semantic match, then compressed LLM call

Related Reading

External Resources

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top