Skip to main content

Building RAG Applications with Semantic Kernel and Azure OpenAI

January 31, 2026 3 min read

Retrieval-Augmented Generation (RAG) lets you build AI applications that work with private data by retrieving relevant context at query time instead of fine-tuning. Let's build a production RAG system using Semantic Kernel and Azure OpenAI — the stack where .NET teams find the most success.

The RAG Pattern

RAG has three phases: Ingestion (chunk documents → embed → store in vector DB), Retrieval (embed query → search for similar chunks), and Generation (inject chunks into prompt → LLM generates grounded response).

User Query → [Embed] → [Vector Search] → [Top-K Chunks]
                                              ↓
                                    [Prompt Template + LLM]
                                              ↓
                                      Grounded Response

Setting Up Semantic Kernel

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddKernel()
    .AddAzureOpenAIChatCompletion(
        deploymentName: "gpt-4o",
        endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
        apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!)
    .AddAzureOpenAITextEmbeddingGeneration(
        deploymentName: "text-embedding-3-large",
        endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
        apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);

builder.Services.AddAzureAISearchVectorStore(
    new Uri(builder.Configuration["AzureSearch:Endpoint"]!),
    new Azure.AzureKeyCredential(
        builder.Configuration["AzureSearch:ApiKey"]!));

builder.Services.AddScoped<RagService>();
var app = builder.Build();

app.MapPost("/api/ask", async (AskRequest request, RagService ragService) =>
{
    var response = await ragService.AskAsync(request.Question);
    return Results.Ok(new { answer = response.Answer, sources = response.Sources });
});

app.Run();

Use text-embedding-3-large (3072 dims) for best retrieval quality, or text-embedding-3-small (1536 dims) to save costs. Avoid ada-002 for new projects.

Retrieval and Response Generation

The core RAG service ties embedding, search, and generation together:

public class RagService
{
    private readonly Kernel _kernel;
    private readonly IVectorStore _vectorStore;
    private readonly ITextEmbeddingGenerationService _embeddingService;

    public async Task<RagResponse> AskAsync(
        string question, int topK = 5, float minRelevance = 0.75f)
    {
        var queryEmbedding = await _embeddingService
            .GenerateEmbeddingAsync(question);

        var collection = _vectorStore
            .GetCollection<string, DocumentChunk>("knowledge-base");
        var searchResults = await collection.VectorizedSearchAsync(
            queryEmbedding, new VectorSearchOptions { Top = topK });

        var relevantChunks = new List<DocumentChunk>();
        await foreach (var result in searchResults.Results)
            if (result.Score >= minRelevance)
                relevantChunks.Add(result.Record);

        if (relevantChunks.Count == 0)
            return new RagResponse("Not enough information in the documentation.", []);

        var context = string.Join("\n\n---\n\n",
            relevantChunks.Select(c => $"[Source: {c.DocumentTitle}]\n{c.Text}"));

        var prompt = $"""
            Answer based ONLY on the provided context. Cite sources.
            ## Context
            {context}
            ## Question
            {question}
            """;

        var chatService = _kernel.GetRequiredService<IChatCompletionService>();
        var chatHistory = new ChatHistory();
        chatHistory.AddUserMessage(prompt);

        var response = await chatService.GetChatMessageContentAsync(
            chatHistory,
            new AzureOpenAIPromptExecutionSettings
            {
                Temperature = 0.1f,
                MaxTokens = 1024,
            });

        return new RagResponse(
            response.Content ?? "Unable to generate a response.",
            relevantChunks.Select(c => c.DocumentTitle).Distinct().ToList());
    }
}

Improving Retrieval Quality

Pure vector search gets you 70% of the way. To reach 90%+, use hybrid search (vector + keyword) and query expansion — ask the LLM to rephrase the query into 2-3 alternative phrasings before searching. This single technique can improve answer accuracy by ~25%.

Chunking matters most: Start with 512-token paragraph-aware chunks with 50-token overlap. For code-heavy content, split on function/class boundaries instead.

Production tips:

  • Keep context to 3-5 chunks (~2000-3000 tokens) — more context means more noise
  • Cache embeddings with a 24-hour TTL to reduce costs
  • Build an evaluation dataset early — you can't improve what you can't measure

Key Takeaways

  1. Chunking is the most important decision. Start with 512-token chunks and iterate based on retrieval quality.
  2. Use hybrid search from day one — pure vector search misses exact-match scenarios.
  3. Keep temperature low (0.1) for grounded, factual responses.
  4. Query expansion delivers outsized impact with minimal effort.
  5. Cache embeddings aggressively — the same queries hit your system repeatedly.
  6. Build evaluation infrastructure early — RAG quality is hard to judge manually.
Share this post

Comments

Ajit Gangurde

Software Engineer II at Microsoft | 15+ years in .NET & Azure