How We Built a Production-Grade RAG Engine for a Website AI Chatbot
(If you prefer video content, please watch the concise video summary of this article below)
Modern websites don’t suffer from a lack of information — on the contrary, they often suffer from information overload. Blogs, news, service pages, event announcements, case studies… everything is useful, everything is scaling, and everything is scattered across navigation paths that make little sense to a visitor who just wants one specific answer.
Key Takeaways
- Production-grade AI chatbots require a RAG architecture that retrieves the right content first instead of relying on oversized prompts or fine-tuned models.
- Clean content extraction, semantic chunking, and hybrid retrieval (dense + lexical reranking) are critical to answer quality and relevance.
- Local, modular RAG components provide stronger control over cost, privacy, and scalability than fully managed cloud ingestion pipelines.
- High-quality chatbot answers are an architectural outcome — driven by ingestion, retrieval, and ranking decisions, not by the LLM alone.
That’s exactly why we launched an internal RAG (Retrieval-Augmented Generation) project: to power a website chatbot that can answer questions based on the real site content. The chatbot can do it reliably and privately, without pretending that it “knows” things it has never seen.
Explore the practical story of how we built it, what didn’t work at first, and what finally made the answers noticeably sharper.
Leverage AI to transform your business with custom solutions from SaM Solutions’ expert developers.
Client’s Business Request
The project started with a client request to design and implement a scalable AI platform.
Our ultimate objective was to create a system that would:
- register and access the system;
- connect several different models;
- embed an AI pop-up chat on any website.
Within the MCP scope, the chatbot required a few definite features:
- answer user questions with the help of website content (blogs, articles, news, event pages);
- send an email from the chat flow (so a visitor can ask questions, provide missing details, and trigger outreach without the necessity to file a classic “Contact us” form).
Understanding the Real Problem
A website chatbot may look simple at first glance. Nevertheless, it seems so until you try to make it accurate. The core constraint was straightforward:
- A general LLM doesn’t know the сurrent state of the website by default.
- Large cloud-based language models introduce significant concerns around cost, privacy, and dependency on external infrastructure. It is especially significant when it is applied beyond public web content to internal or sensitive knowledge.
- As a result, many teams prefer to choose smaller, local models. However, this shift exposes a new bottleneck: limited context windows. When you feed entire documents or pages into a single prompt that no longer scales, the inputs get truncated, and irrelevant sections may consume valuable context. So, the “one huge prompt” approach becomes fragile and inefficient.
So instead of “teaching” the model everything at once, we chose the approach that works in real production systems: retrieve the right pieces of content first, then generate the answer from those pieces. That’s RAG.
Exploring Possible Solutions
We looked at several possible directions before committing:
Option 1: “Just prompt it with the page”
Why it failed:
- Content pages embed LLM-irrelevant noise (e.g., navigation, repeated sections, hidden elements, cookie notices) unrelated to query-relevant information;
- Context limitations mean the model discards content anyway;
- Answers may vary depending on what got cut off.
Option 2: Cloud ingestion + managed search (e.g., Azure)
Why we didn’t pick it for this stage:
- Cost grows fast once you index frequently and scale usage;
- Privacy and control become more complex in the long run, especially for internal extensions.
Option 3: Fine-tuning
Why it didn’t fit:
- Constant website updates mean constant re-training or drift;
- Fine-tuning doesn’t automatically explain where you got that answer from;
- It is also computationally expensive and demands deep ML expertise to train, maintain, and debug models effectively.
Option 4: RAG with local components
Why we chose it:
- We keep total control over data and cost;
- We can prove the concept on public content first and then confidently extend to private knowledge later;
- We can continuously update the knowledge base with the help of content re-indexing.
Talk to our AI specialists about building smart, scalable software for your business.
How We Ran the Process
To deliver an internal RAG engine that spans ingestion, retrieval, LLM orchestration, and website embedding, we assembled a cross-functional team:
- Project Manager (PM) — aligned stakeholders, defined iterations and milestones, and kept scope under control as long as requirements evolved;
- Architect — owned the target architecture, integration approach, and key technical decisions, such as security, scalability, and data flow;
- .NET Engineer — implemented the RAG services, retrieval pipeline, vector database integration, and MCP tools (search, actions);
- Frontend Engineer — built the chatbot UI and the user flows needed to embed it on the website and make it usable in real sessions;
- 2 Java Engineers — supported Java-based backend services and integrations; developed the MCP platform in both .NET and Java, with a unified architecture and the Java implementation selected for production; built and maintained CI/CD, Kubernetes, GitOps, and monitored the platform stability.

This setup let us iterate quickly and improve the final answer quality, while MCP’s formalized communication protocol enabled a multi-language, heterogeneous team to collaborate without depending on a single skill set. We built the solution in iterations — the first one worked “technically,” but not “product-wise.”
Version 1: A quick Python prototype (worked, but messy)
We started with a script that:
- crawled pages;
- extracted content;
- generated JSON files with extracted content;
- which were later processed by a separate utility to create embeddings and store them in the vector database.
What went wrong:
- We were embedding entire pages, including HTML and tags, which produced lots of irrelevant semantic noise;
- We also hit practical issues around stable text-to-embedding conversion and consistency.
Result: the chatbot could answer, but its responses often felt fuzzy, overly broad, or based on the wrong fragment.
Version 2: A microservice-based pipeline (cleaner, scalable)
Once our microservices team got involved, we redesigned the ingestion approach: instead of “walking” through cross-links, the service used site APIs where possible. The microservice:
- pulled content;
- cleaned it;
- split it into chunks;
- embedded those chunks;
- and pushed them into the vector database.
This alone improved relevance, because the model stopped “learning” navigation menus and repeated UI blocks.
Designing the RAG Strategy
RAG success depends on two things:
- How you chunk content
- How do you rank what you retrieved
Chunking experiments we tested
We explored multiple strategies, since manual, human-driven splitting is slow, hard to maintain with updates, and not a scalable approach.
What we explored included:
- Sentence-window chunking
- Semantic chunking (split/merge based on embedding similarity)
- Hybrid approaches that combined fixed windows and semantic merging
Libraries and components we used in this exploration:
- TextChunker (Microsoft.SemanticKernel.Text)
- drittich.SemanticSlicer
- SemanticChunker.NET
- Custom strategies like SemanticDoubleParseMergeStrategy and WindowChunkStrategy
WindowChunkStrategy (the idea):
- Take a 3-sentence window
- Shift by one sentence → next window
- Compare embeddings of neighboring windows
- Merge them if they’re semantically close
This helped keep meaning intact without creating giant blobs of text. In the end, we settled on SemanticSlicer.
Final retrieval flow (the part that changed everything)
Our first implementation did a basic vector search. It worked — but not as well as we needed. So we added two key steps: query enrichment and reranking. Here’s the simplified pipeline:
Here’s the simplified pipeline after the user asks a question in chat (example: “What expertise does your company have?”):
Before embedding the query, we send it to the LLM that expands it semantically. Example of transformation: “the company’s expertise”→ “the company’s expertise, successful projects, clients, competencies, industries.” We now run retrieval twice: vector search for the original query and vector search for the enriched query.
The system retrieves 15 results from the enriched query and 10 from the original one, then reranks them and selects the top 5 for response generation. This favors chunks with strong keyword overlap and rare, distinctive terms. Built as a modular microservice platform, all retrieval and ranking parameters are configurable on the fly to optimize relevance, performance, and scale. We select the top 5 chunks. Those records become the grounded context for the LLM answer.
This is where responses noticeably improved: tighter answers, fewer irrelevant citations, and better alignment with how humans actually ask questions.

Key Features Implemented
The following features form the core of the system, enabling reliable actions, accurate retrieval, and traceable knowledge grounding in real production use:
Within MCP, we implemented tools that the assistant can call programmatically: vector database search (semantic retrieval) and email sending (lead capture / follow-up).
Each chunk is traceable back to its source page. Re-indexing runs on a schedule (initially every few hours / up to twice per day depending on the pre-set configuration).
We added two-pass retrieval (original + enriched query), BM25 reranking, top-K selection to keep prompts small and relevant, which, in essence, describes how RAG works overall, since its core task is to find the documents most relevant to a given query.
Overcoming Challenges
Results and Business Impact
Even at this stage, the impact is already clear:
What changed for users
- Visitors can ask questions in natural language instead of hunting through menus;
- The chatbot provides faster discovery across blogs, articles, and updates;
- The email tool enables a smoother lead flow: ask a question → provide missing details → send a message without the traditional form of friction.
What changed for the business
- Better content utilization: valuable pages are not being “buried” anymore;
- Scalable approach: we can extend the same architecture beyond public pages;
- Cost and privacy control: the system is built around retrieving only what’s needed, rather than pushing everything into external prompts.
What’s Next
This project is still evolving. The next steps are focused on reliability and scale:
- Automated evaluation for retrieval + grounded answers
- Improved multilingual handling (especially for names and short queries)
- Stronger hybrid retrieval (sparse + dense) to reduce “false friends” in vectors
- Finalizing infrastructure pieces (AI server and RAG collection setup)
- Revisiting multi-site coverage (the US site may run its own MCP server)
Summary
This project demonstrates how a production-grade RAG architecture can turn an AI chatbot from a surface-level interface into a reliable knowledge access layer for a growing website. By combining structured content ingestion, semantic chunking, hybrid retrieval, and controlled LLM orchestration, we built a system that delivers accurate, grounded answers based on real content — not assumptions or hallucinations.
The solution improves content discoverability for users and establishes a foundation that can be extended to internal knowledge bases, multi-language environments, and data sources. Most importantly, it proves that high-quality AI assistants are not created by prompts alone, but by deliberate architectural decisions across data, retrieval, and infrastructure.

For organizations facing similar challenges with corporate content translation and localization, locally deployed AI models offer a powerful alternative to traditional translation methods, balancing autonomy, control, and performance in one integrated solution.
Andrey Kopanev, Senior .NET Developer, AI Enthusiast





