Building RAG Applications with Semantic Kernel and Azure OpenAI
Retrieval-Augmented Generation (RAG) lets you build AI applications that work with private data by retrieving relevant context at query time instead of fine-tuning. Let's build a production RAG system using Semantic Kernel and Azure OpenAI — the stack where .NET teams find the most success.
The RAG Pattern
RAG has three phases: Ingestion (chunk documents → embed → store in vector DB), Retrieval (embed query → search for similar chunks), and Generation (inject chunks into prompt → LLM generates grounded response).
User Query → [Embed] → [Vector Search] → [Top-K Chunks]
↓
[Prompt Template + LLM]
↓
Grounded Response
Setting Up Semantic Kernel
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddKernel()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!)
.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: "text-embedding-3-large",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);
builder.Services.AddAzureAISearchVectorStore(
new Uri(builder.Configuration["AzureSearch:Endpoint"]!),
new Azure.AzureKeyCredential(
builder.Configuration["AzureSearch:ApiKey"]!));
builder.Services.AddScoped<RagService>();
var app = builder.Build();
app.MapPost("/api/ask", async (AskRequest request, RagService ragService) =>
{
var response = await ragService.AskAsync(request.Question);
return Results.Ok(new { answer = response.Answer, sources = response.Sources });
});
app.Run();
Use text-embedding-3-large (3072 dims) for best retrieval quality, or text-embedding-3-small (1536 dims) to save costs. Avoid ada-002 for new projects.
Retrieval and Response Generation
The core RAG service ties embedding, search, and generation together:
public class RagService
{
private readonly Kernel _kernel;
private readonly IVectorStore _vectorStore;
private readonly ITextEmbeddingGenerationService _embeddingService;
public async Task<RagResponse> AskAsync(
string question, int topK = 5, float minRelevance = 0.75f)
{
var queryEmbedding = await _embeddingService
.GenerateEmbeddingAsync(question);
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
var searchResults = await collection.VectorizedSearchAsync(
queryEmbedding, new VectorSearchOptions { Top = topK });
var relevantChunks = new List<DocumentChunk>();
await foreach (var result in searchResults.Results)
if (result.Score >= minRelevance)
relevantChunks.Add(result.Record);
if (relevantChunks.Count == 0)
return new RagResponse("Not enough information in the documentation.", []);
var context = string.Join("\n\n---\n\n",
relevantChunks.Select(c => $"[Source: {c.DocumentTitle}]\n{c.Text}"));
var prompt = $"""
Answer based ONLY on the provided context. Cite sources.
## Context
{context}
## Question
{question}
""";
var chatService = _kernel.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(prompt);
var response = await chatService.GetChatMessageContentAsync(
chatHistory,
new AzureOpenAIPromptExecutionSettings
{
Temperature = 0.1f,
MaxTokens = 1024,
});
return new RagResponse(
response.Content ?? "Unable to generate a response.",
relevantChunks.Select(c => c.DocumentTitle).Distinct().ToList());
}
}
Improving Retrieval Quality
Pure vector search gets you 70% of the way. To reach 90%+, use hybrid search (vector + keyword) and query expansion — ask the LLM to rephrase the query into 2-3 alternative phrasings before searching. This single technique can improve answer accuracy by ~25%.
Chunking matters most: Start with 512-token paragraph-aware chunks with 50-token overlap. For code-heavy content, split on function/class boundaries instead.
Production tips:
- Keep context to 3-5 chunks (~2000-3000 tokens) — more context means more noise
- Cache embeddings with a 24-hour TTL to reduce costs
- Build an evaluation dataset early — you can't improve what you can't measure
Key Takeaways
- Chunking is the most important decision. Start with 512-token chunks and iterate based on retrieval quality.
- Use hybrid search from day one — pure vector search misses exact-match scenarios.
- Keep temperature low (0.1) for grounded, factual responses.
- Query expansion delivers outsized impact with minimal effort.
- Cache embeddings aggressively — the same queries hit your system repeatedly.
- Build evaluation infrastructure early — RAG quality is hard to judge manually.
Comments
Ajit Gangurde
Software Engineer II at Microsoft | 15+ years in .NET & Azure