en

How We Built a Production-Grade RAG Engine for a Website AI Chatbot

Real-Life Projects

AI & Machine Learning

Digital Transformation

(If you prefer video content, please watch the concise video summary of this article below)

Modern websites don’t suffer from a lack of information — on the contrary, they often suffer from information overload. Blogs, news, service pages, event announcements, case studies… everything is useful, everything is scaling, and everything is scattered across navigation paths that make little sense to a visitor who just wants one specific answer.

Key Takeaways

Production-grade AI chatbots require a RAG architecture that retrieves the right content first instead of relying on oversized prompts or fine-tuned models.
Clean content extraction, semantic chunking, and hybrid retrieval (dense + lexical reranking) are critical to answer quality and relevance.
Local, modular RAG components provide stronger control over cost, privacy, and scalability than fully managed cloud ingestion pipelines.
High-quality chatbot answers are an architectural outcome — driven by ingestion, retrieval, and ranking decisions, not by the LLM alone.

That’s exactly why we launched an internal RAG (Retrieval-Augmented Generation) project: to power a website chatbot that can answer questions based on the real site content. The chatbot can do it reliably and privately, without pretending that it “knows” things it has never seen.

Explore the practical story of how we built it, what didn’t work at first, and what finally made the answers noticeably sharper.

Leverage AI to transform your business with custom solutions from SaM Solutions’ expert developers.

Client’s Business Request

The project started with a client request to design and implement a scalable AI platform.

Our ultimate objective was to create a system that would:

register and access the system;
connect several different models;
embed an AI pop-up chat on any website.

Within the MCP scope, the chatbot required a few definite features:

answer user questions with the help of website content (blogs, articles, news, event pages);
send an email from the chat flow (so a visitor can ask questions, provide missing details, and trigger outreach without the necessity to file a classic “Contact us” form).

Understanding the Real Problem

A website chatbot may look simple at first glance. Nevertheless, it seems so until you try to make it accurate. The core constraint was straightforward:

A general LLM doesn’t know the сurrent state of the website by default.
Large cloud-based language models introduce significant concerns around cost, privacy, and dependency on external infrastructure. It is especially significant when it is applied beyond public web content to internal or sensitive knowledge.
As a result, many teams prefer to choose smaller, local models. However, this shift exposes a new bottleneck: limited context windows. When you feed entire documents or pages into a single prompt that no longer scales, the inputs get truncated, and irrelevant sections may consume valuable context. So, the “one huge prompt” approach becomes fragile and inefficient.

So instead of “teaching” the model everything at once, we chose the approach that works in real production systems: retrieve the right pieces of content first, then generate the answer from those pieces. That’s RAG.

Exploring Possible Solutions

We looked at several possible directions before committing:

Option 1: “Just prompt it with the page”

Why it failed:

Content pages embed LLM-irrelevant noise (e.g., navigation, repeated sections, hidden elements, cookie notices) unrelated to query-relevant information;
Context limitations mean the model discards content anyway;
Answers may vary depending on what got cut off.

Option 2: Cloud ingestion + managed search (e.g., Azure)

Why we didn’t pick it for this stage:

Cost grows fast once you index frequently and scale usage;
Privacy and control become more complex in the long run, especially for internal extensions.

Option 3: Fine-tuning

Why it didn’t fit:

Constant website updates mean constant re-training or drift;
Fine-tuning doesn’t automatically explain where you got that answer from;
It is also computationally expensive and demands deep ML expertise to train, maintain, and debug models effectively.

Option 4: RAG with local components

Why we chose it:

We keep total control over data and cost;
We can prove the concept on public content first and then confidently extend to private knowledge later;
We can continuously update the knowledge base with the help of content re-indexing.

Talk to our AI specialists about building smart, scalable software for your business.

How We Ran the Process

To deliver an internal RAG engine that spans ingestion, retrieval, LLM orchestration, and website embedding, we assembled a cross-functional team:

Project Manager (PM) — aligned stakeholders, defined iterations and milestones, and kept scope under control as long as requirements evolved;
Architect — owned the target architecture, integration approach, and key technical decisions, such as security, scalability, and data flow;
.NET Engineer — implemented the RAG services, retrieval pipeline, vector database integration, and MCP tools (search, actions);
Frontend Engineer — built the chatbot UI and the user flows needed to embed it on the website and make it usable in real sessions;
2 Java Engineers — supported Java-based backend services and integrations; developed the MCP platform in both .NET and Java, with a unified architecture and the Java implementation selected for production; built and maintained CI/CD, Kubernetes, GitOps, and monitored the platform stability.

How We Ran the Process team composition

This setup let us iterate quickly and improve the final answer quality, while MCP’s formalized communication protocol enabled a multi-language, heterogeneous team to collaborate without depending on a single skill set. We built the solution in iterations — the first one worked “technically,” but not “product-wise.”

Version 1: A quick Python prototype (worked, but messy)

We started with a script that:

crawled pages;
extracted content;
generated JSON files with extracted content;
which were later processed by a separate utility to create embeddings and store them in the vector database.

What went wrong:

We were embedding entire pages, including HTML and tags, which produced lots of irrelevant semantic noise;
We also hit practical issues around stable text-to-embedding conversion and consistency.

Result: the chatbot could answer, but its responses often felt fuzzy, overly broad, or based on the wrong fragment.

Version 2: A microservice-based pipeline (cleaner, scalable)

Once our microservices team got involved, we redesigned the ingestion approach: instead of “walking” through cross-links, the service used site APIs where possible. The microservice:

pulled content;
cleaned it;
split it into chunks;
embedded those chunks;
and pushed them into the vector database.

This alone improved relevance, because the model stopped “learning” navigation menus and repeated UI blocks.

Designing the RAG Strategy

RAG success depends on two things:

How you chunk content
How do you rank what you retrieved

Chunking experiments we tested

We explored multiple strategies, since manual, human-driven splitting is slow, hard to maintain with updates, and not a scalable approach.

What we explored included:

Sentence-window chunking
Semantic chunking (split/merge based on embedding similarity)
Hybrid approaches that combined fixed windows and semantic merging

Libraries and components we used in this exploration:

TextChunker (Microsoft.SemanticKernel.Text)
drittich.SemanticSlicer
SemanticChunker.NET
Custom strategies like SemanticDoubleParseMergeStrategy and WindowChunkStrategy

WindowChunkStrategy (the idea):

Take a 3-sentence window
Shift by one sentence → next window
Compare embeddings of neighboring windows
Merge them if they’re semantically close

This helped keep meaning intact without creating giant blobs of text. In the end, we settled on SemanticSlicer.

Final retrieval flow (the part that changed everything)

Our first implementation did a basic vector search. It worked — but not as well as we needed. So we added two key steps: query enrichment and reranking. Here’s the simplified pipeline:

Here’s the simplified pipeline after the user asks a question in chat (example: “What expertise does your company have?”):

Query enrichment

Before embedding the query, we send it to the LLM that expands it semantically. Example of transformation: “the company’s expertise”→ “the company’s expertise, successful projects, clients, competencies, industries.” We now run retrieval twice: vector search for the original query and vector search for the enriched query.

Reranking (BM25 / lexical relevance)

The system retrieves 15 results from the enriched query and 10 from the original one, then reranks them and selects the top 5 for response generation. This favors chunks with strong keyword overlap and rare, distinctive terms. Built as a modular microservice platform, all retrieval and ranking parameters are configurable on the fly to optimize relevance, performance, and scale. We select the top 5 chunks. Those records become the grounded context for the LLM answer.

This is where responses noticeably improved: tighter answers, fewer irrelevant citations, and better alignment with how humans actually ask questions.

Designing our RAG Strategy

Key Features Implemented

The following features form the core of the system, enabling reliable actions, accurate retrieval, and traceable knowledge grounding in real production use:

MCP tools for real actions

Within MCP, we implemented tools that the assistant can call programmatically: vector database search (semantic retrieval) and email sending (lead capture / follow-up).

A knowledge base built from real website content

Each chunk is traceable back to its source page. Re-indexing runs on a schedule (initially every few hours / up to twice per day depending on the pre-set configuration).

Hybrid retrieval quality improvements

We added two-pass retrieval (original + enriched query), BM25 reranking, top-K selection to keep prompts small and relevant, which, in essence, describes how RAG works overall, since its core task is to find the documents most relevant to a given query.

Overcoming Challenges

Content noise and “embedding the wrong things”

Early on, embedding raw page content caused the assistant to “learn” the wrong signals. Fix: cleaner extraction + chunking focused on meaningful text blocks.

Names and multilingual edge cases

Some user questions were not in English (e.g., “Who is the named person?”). Embedding models that aren’t strong in the foreign languages can misfire. For instance, the system can retrieve another person who has the same name, just because the vector similarity isn’t precise enough. Mitigation ideas we explored:

hybrid sparse+dense retrieval
adding extra signals for person-name detectionq

Evaluation and testing

RAG retrieval itself is deterministic for the same inputs, but its non-determinism comes from query embedding generation and from how the LLM processes and phrases the retrieved context. We learned that the most stable testing approach is to validate:

retrieval correctness (did we fetch the right chunks?)
answer grounding (did the answer use the provided context?)uo

Freshness and indexing latency

New content doesn’t become searchable instantly — ingestion takes time. We had to balance:

infrastructure load
the user expectation that “the chatbot should know what we posted today”
indexing frequency, which was fully configurable (as often as every 15 minutes)

Results and Business Impact

Even at this stage, the impact is already clear:

What changed for users

Visitors can ask questions in natural language instead of hunting through menus;
The chatbot provides faster discovery across blogs, articles, and updates;
The email tool enables a smoother lead flow: ask a question → provide missing details → send a message without the traditional form of friction.

What changed for the business

Better content utilization: valuable pages are not being “buried” anymore;
Scalable approach: we can extend the same architecture beyond public pages;
Cost and privacy control: the system is built around retrieving only what’s needed, rather than pushing everything into external prompts.

What’s Next

This project is still evolving. The next steps are focused on reliability and scale:

Automated evaluation for retrieval + grounded answers
Improved multilingual handling (especially for names and short queries)
Stronger hybrid retrieval (sparse + dense) to reduce “false friends” in vectors
Finalizing infrastructure pieces (AI server and RAG collection setup)
Revisiting multi-site coverage (the US site may run its own MCP server)

Summary

This project demonstrates how a production-grade RAG architecture can turn an AI chatbot from a surface-level interface into a reliable knowledge access layer for a growing website. By combining structured content ingestion, semantic chunking, hybrid retrieval, and controlled LLM orchestration, we built a system that delivers accurate, grounded answers based on real content — not assumptions or hallucinations.

The solution improves content discoverability for users and establishes a foundation that can be extended to internal knowledge bases, multi-language environments, and data sources. Most importantly, it proves that high-quality AI assistants are not created by prompts alone, but by deliberate architectural decisions across data, retrieval, and infrastructure.

Consulting on LLM deployment project

Need to tackle a similar challenge?

For organizations facing similar challenges with corporate content translation and localization, locally deployed AI models offer a powerful alternative to traditional translation methods, balancing autonomy, control, and performance in one integrated solution.

Andrey Kopanev, Senior .NET Developer, AI Enthusiast

Let’s talk about your project

Tech Expert

Andrey Kopanev

.NET developer, AI enthusiast

About the Expert

Tech Expert

Gleb Shurov Java Software Developer

Gleb Shurov

Java Software Developer

About the Expert

Author

Maryia Shapel

IT INDUSTRY OBSERVER

About the Author

Editorial Guidelines

Leave a Comment