Spring AI RAG & Vector Search Production Guide

Spring AI with RAG and Vector Search

Spring AI RAG vector search integration brings the power of retrieval-augmented generation to the Java ecosystem. Spring AI 1.0, released in early 2026, provides first-class support for connecting language models with your proprietary data through vector databases, making it straightforward to build AI features in existing Spring Boot applications.

This guide walks you through building a production RAG pipeline — from ingesting documents and generating embeddings to querying vector stores and augmenting LLM prompts with relevant context. If your team already runs Spring Boot in production, Spring AI is the fastest path to adding intelligent search and Q&A capabilities.

Understanding the RAG Architecture

RAG solves a fundamental limitation of language models: they only know what they were trained on. By retrieving relevant documents from your data and injecting them into the prompt, you get accurate answers grounded in your specific content.

Spring AI RAG vector search architecture — RAG pipeline: ingest documents, generate embeddings, retrieve context, augment prompts

RAG Pipeline Flow:

1. INGESTION (offline)
   Documents → Chunking → Embedding Model → Vector Database

2. RETRIEVAL (runtime)
   User Query → Embedding → Vector Similarity Search → Top-K Documents

3. AUGMENTATION (runtime)
   System Prompt + Retrieved Documents + User Query → LLM → Response

Setting Up Spring AI with PGVector

<!-- pom.xml dependencies -->
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o
          temperature: 0.3
      embedding:
        options:
          model: text-embedding-3-small
    vectorstore:
      pgvector:
        dimensions: 1536
        index-type: HNSW
        distance-type: COSINE_DISTANCE
  datasource:
    url: jdbc:postgresql://localhost:5432/aiapp

Document Ingestion Pipeline

@Service
public class DocumentIngestionService {

    private final VectorStore vectorStore;
    private final TokenTextSplitter textSplitter;

    public DocumentIngestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        this.textSplitter = new TokenTextSplitter(
            800,    // chunk size (tokens)
            200,    // overlap (tokens)
            5,      // min chunk size
            10000,  // max chunk size
            true    // keep separators
        );
    }

    public void ingestDocuments(List<Resource> resources) {
        var documentReader = new TikaDocumentReader(resources);
        var documents = documentReader.get();

        // Split into chunks with metadata
        var chunks = textSplitter.apply(documents);

        // Add custom metadata for filtering
        chunks.forEach(chunk -> {
            chunk.getMetadata().put("source", chunk.getMetadata()
                .getOrDefault("source", "unknown"));
            chunk.getMetadata().put("ingested_at",
                Instant.now().toString());
        });

        // Store embeddings in PGVector
        vectorStore.add(chunks);
        log.info("Ingested {} chunks from {} documents",
            chunks.size(), documents.size());
    }

    // Ingest from various sources
    public void ingestPDF(String path) {
        ingestDocuments(List.of(new FileSystemResource(path)));
    }

    public void ingestWebPage(String url) {
        var reader = new JsoupDocumentReader(url);
        var docs = textSplitter.apply(reader.get());
        vectorStore.add(docs);
    }
}

Building the RAG Query Service

@Service
public class RagQueryService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public RagQueryService(ChatClient.Builder builder,
                           VectorStore vectorStore) {
        this.chatClient = builder
            .defaultSystem("""
                You are a helpful assistant that answers questions
                based on the provided context. If the context doesn't
                contain relevant information, say so honestly.
                Always cite which document your answer comes from.
                """)
            .build();
        this.vectorStore = vectorStore;
    }

    public String query(String userQuestion) {
        // Retrieve relevant documents
        var searchRequest = SearchRequest.query(userQuestion)
            .withTopK(5)
            .withSimilarityThreshold(0.7);

        var relevantDocs = vectorStore.similaritySearch(searchRequest);

        // Build context from retrieved documents
        String context = relevantDocs.stream()
            .map(doc -> "Source: %s
Content: %s".formatted(
                doc.getMetadata().get("source"),
                doc.getContent()))
            .collect(Collectors.joining("

---

"));

        // Augmented prompt with retrieved context
        return chatClient.prompt()
            .user(u -> u.text("""
                Context:
                {context}

                Question: {question}

                Answer based on the context above:
                """)
                .param("context", context)
                .param("question", userQuestion))
            .call()
            .content();
    }
}

Vector search performance monitoring — Monitoring RAG pipeline performance and retrieval quality

Advanced Patterns

Metadata Filtering

// Filter by metadata during retrieval
var searchRequest = SearchRequest.query(question)
    .withTopK(5)
    .withFilterExpression(new FilterExpressionBuilder()
        .eq("department", "engineering")
        .and(b -> b.gte("ingested_at", "2026-01-01"))
        .build());

Streaming Responses

@GetMapping(value = "/api/ask", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> streamAnswer(@RequestParam String question) {
    var context = retrieveContext(question);
    return chatClient.prompt()
        .user(buildPrompt(context, question))
        .stream()
        .content();
}

Spring AI application dashboard — Production RAG application with Spring AI serving real-time queries

When NOT to Use RAG

RAG adds complexity and latency. Skip it when the LLM already knows the answer (general knowledge), when data changes faster than you can re-index, or when exact keyword search (Elasticsearch) is sufficient. Consequently, evaluate whether a simple prompt or fine-tuned model would serve your use case better.

Key Takeaways

Spring AI RAG brings intelligent search to Java applications with minimal boilerplate. PGVector integration means you can add vector search to your existing PostgreSQL database. As a result, start with a small document corpus, measure retrieval quality, and expand as you validate the approach with real users.

Spring AI with RAG and Vector Search: Building Intelligent Java Applications