Testing an LLM Chatbot in an MCP System

Key Takeaways

  • Determinism no longer applies: LLM chatbot testing shifts from exact-match assertions to probabilistic, semantic validation, where multiple correct answers can exist for the same input.
  • Architecture defines test complexity: MCP orchestration, RAG pipelines, tool calls, and streaming responses create multiple failure points, making root-cause analysis inherently multi-layered.
  • Validation must be multi-dimensional: Combining must-have, must-not, and semantic similarity checks is essential to balance flexibility with control and reduce hallucination risks.
  • Test results are context- and configuration-dependent: Model version, prompt design, inference settings, and conversation history all influence outcomes, requiring continuous tuning and iterative test refinement.

Introduction: The Paradigm Shift in Quality Assurance for LLM-Based Systems

Testing an LLM chatbot inside an MCP-based system differs from testing classical software. Traditional systems are deterministic: the same input produces the same output. In a typical REST API, a request either returns the expected JSON payload or it does not. Assertions are straightforward.

A chatbot built around a large language model behaves differently. Testing a Large Language Model (LLM) output requires a fundamental paradigm shift. The assumptions that have governed software testing for decades — determinism, exact reproducibility, and binary state validation — break down when confronted with generative AI.

To understand the complexity of testing these applications, we must first look at the underlying architecture, explore why traditional assertions fail, and examine the unique, context-dependent pitfalls and specifics.

Reap the benefits of high quality software applications with SaM Solutions’ expert QA and testing services.

System Architecture Overview

A modern LLM-based chatbot is a complex, multi-layered distributed system under the hood where each component introduces new variables into the testing equation.

When a user submits a prompt, it travels through several critical server-side components before a response is generated. Initially, the input is often processed by an orchestrator or reasoning engine. In enterprise environments like ours, this is typically where the Model Context Protocol (MCP) comes into play. MCP allows the LLM to securely interact with external data sources and internal tools without hardcoding integrations. 

Simultaneously, the system employs a Retrieval-Augmented Generation (RAG) pattern. Before the LLM generates a response, the user’s query is embedded and sent to a vector database to retrieve semantically relevant context. This retrieved context, along with system instructions and chat history, is dynamically injected into a hidden meta-prompt. Only then is the payload sent to the inference engine (the model serving layer). Finally, the LLM generates tokens sequentially, which are streamed back to the client via a persistent connection, such as WebSockets using SignalR.

Challenge

These architectural decisions directly impact testability. Testing the “chatbot” means simultaneously testing the retrieval mechanisms, the orchestration layer, and the generative model itself. Therefore, failures in such an environment rarely come from a single place. If the chatbot answers incorrectly, the cause may be:

  • retrieval returned irrelevant documents
  • the prompt not optimized properly for use cases
  • the correct document was retrieved but the model ignored it
  • the model invented information not present in the context
  • the chatbot did not call a tool to trigger specific action or retrieve the specific data
  • the tool returned an error that was not propagated to the model
  • the context window truncated relevant information

The chatbot also operates inside a conversation. A response may depend on previous turns, retrieved documents, system prompts, and tool outputs. Testing a single prompt in isolation does not always reproduce the behavior seen in real conversations.

The business context adds pressure. In this system, the chatbot appears on a company website and answers questions from potential clients about the company’s experience and projects. If the bot invents projects or misunderstands a request, the damage goes beyond incorrect information. It can actively simulate successful lead handling, confirming that a contact request or submission has been sent to a sales team when in reality no downstream process has been triggered. The result is a broken conversion flow: the user believes a handoff to a human agent has occurred, while no lead is recorded, no notification is sent, and no follow-up ever happens!

Because of this, testing required a combination of traditional QA techniques and evaluation methods designed for LLM systems.

Fundamental differences in testing LLM output vs. deterministic systems

Classical software testing is built on determinism: given state *A* and input *B*, you expect that the system returns output *C*. If it returns *D*, you report a bug.

LLMs are inherently probabilistic. They calculate a probability distribution over the next possible token in a sequence. Consequently, identical inputs can produce different outputs. This non-deterministic nature obliterates traditional regression testing workflows. If you write an exact-match assertion expecting the bot to say, “The application is a web-based SaaS platform,” and the bot instead replies, “The software is an online platform delivered via SaaS,” a deterministic test fails. 

This introduces the semantic correctness problem. An LLM’s output can be grammatically distinct, utilize different vocabulary, and be structured entirely differently, yet remain 100% factually accurate and valid. 

Because of this, traditional bug classification and reproducibility workflows break down. A QA engineer cannot easily attach a “steps to reproduce” ticket for an LLM hallucination, because following those exact steps five minutes later may yield a perfect response. 

Configuration-dependent nature of system output

Even when employing advanced semantic testing, QA teams must navigate a minefield of configuration-dependent variables that make test suites uniquely fragile.

First, test validity is tightly coupled to specific model versions. Different models have their own specifics. A test suite becomes a snapshot of expected behavior for a specific model at a specific time.

Second, inference settings like `Temperature` (which controls randomness) and `Top-P` (which controls vocabulary diversity) act as hidden test variables. A suite that is somewhat stable at Temperature 0.2 may become less deterministic at Temperature 0.7.

Furthermore, these tests are hyper-sensitive to system configuration. Small adjustments to the system prompt, even seemingly innocuous wording changes, can drastically alter the downstream outputs. 

This leads to a persistent challenge: distinguishing system regressions from expected variance. When a test fails, the team must determine if the system actually broke (e.g., the RAG database went offline) or if the model merely generated a statistically improbable, but acceptable, variation of the answer that the semantic evaluator wasn’t tuned to handle.

Finally, multi-turn conversations introduce severe state pollution. Because the model relies on conversation history, an imperfect answer in turn one can corrupt the LLM’s context window for turn three. Testing multi-turn flows requires isolating the state, carefully managing the conversational context, and continuously re-validating the entire suite as the system evolves.

Thus, a test captures a constrained observation window: a single slice of behavior produced by a given model version, decoding configuration, system prompt, input prompt, and retrieval and conversation state. It represents one trajectory through a much larger probabilistic space of possible outputs.

***

The following paragraph details exactly how we built a tool to meet these challenges head-on.

Functional Testing

First of all, the list of use cases has been created. Functional testing started with the main user scenarios expected on the website. 

Typical questions included:

  • experience in specific industries
  • technologies used for back-end or front-end development
  • examples of previous projects
  • rough project estimates
  • how to contact the sales team

Visitors usually ask about the company’s experience, technologies, and previous projects. Some conversations also lead to contact requests.

Later this list was expanded to test cases. Each test case is structured as an ordinary one, but has some specific inherent to AI-powered systems. There are the sections describing what must be, what is appropriate in response, and what must not be in it in any circumstances.

User’s question such as “Have you built any healthcare platforms before?” should produce an answer based on portfolio data stored in the knowledge base. Basically, the answer should mention real projects if they exist and avoid inventing clients.

Here is the story of how we built a custom, end-to-end Python-based test harness designed for end-to-end validation of streaming chatbot responses.

The challenge: WebSockets and non-deterministic outputs

The chatbot streams tokens sequentially via SignalR over WebSockets. We couldn’t just fire off an HTTP POST and read the JSON response. Therefore we created a modular Python framework broken down into a SignalR client, an evaluation engine, and a streamlined test runner.

It has been designed considering the separation of concerns principle: test data (JSON-based test cases and validation rules) is decoupled from the transport layer (SignalR/WebSockets), interpretation logic (NLP analysis), and the execution runner.

Building the SignalR Client

The first step was establishing communication. Since our chatbot works via SignalR, we opted for the lightweight `websocket-client` library in Python rather than pulling in heavy browser automation tools like Playwright or Selenium, as our goal was to test the API/back-end logic directly (Integration/E2E level without the UI overhead).

SignalR has its own quirks. It requires a specific JSON handshake (`{“protocol”: “json”, “version”: 1}`) and appends a very specific terminating character (`\x1e`) to the end of every payload. 

Our client script establishes the WebSocket connection, manages the handshake, and enters a `while True` listening loop. Because the LLM streams its response by small chunks of data, the client parses incoming `ReceiveMessage` events, concatenating the text chunks until it receives an `isComplete: True` flag from the server, at which point it gracefully closes the socket and passes the complete string to our evaluator.

The three-layered validation strategy

Once we had the full text string from the chatbot, we needed to decide if it was “correct”. We implemented a three-tiered quality gate:

The “must-have” check (with synonyms)

While LLMs vary their phrasing, there are often hard business requirements regarding what must be mentioned. Using a JSON-driven test data approach, we define `must_have` arrays. To prevent flakiness, we built a synonym engine.

For example, if the test requires the bot to mention the application is “web-based”, our test data maps “web-based” to `[“saas”, “online platform”, “web application”, “AJAX-based”]`. If the bot uses any of those terms, the assertion passes.

The “must-not” check (hallucination prevention)

Equally important to what the bot says is what it should not say. AI models are prone to hallucination. If a user asks about a legacy accounting web app, the bot shouldn’t invent features. We feed the framework a `must_not` array containing terms like “mobile app”, “blockchain”, or “AI analytics”. If these are detected, the test immediately fails.

This mechanism forms a baseline validation layer. In most cases it produces stable and predictable results because it operates on explicit lexical constraints.

However, this stability is still superficial. For example, the absence of a term does not imply correctness. We had to run the test suite multiple times to expose flaky outputs, iteratively expanding the must_have set with additional terms until the results reached a level of reliability suitable for interpretation.

The weakest component in this setup is the must_not block itself. It assumes that undesired behavior can be exhaustively enumerated. In practice this is impossible.

Semantic similarity (the AI testing the AI)

We still should keep in mind that even if all keywords are present, the sentence structure could be completely wrong.

To solve this, we integrated `sentence-transformers` backed by `torch` and `scikit-learn`. We load the `all-MiniLM-L6-v2` model — a fast, lightweight NLP model perfect for calculating sentence embeddings.

When a test runs, we take the bot’s generated response and a pre-defined `expected_answer` from our JSON test cases (basically it’s taken directly from the data source). We convert both strings into high-dimensional vector embeddings and calculate the cosine similarity. If the similarity score drops below `0.70` (70%, which is also an empirical value, set after several iterations of test execution), the test fails. This allows our chatbot to use completely different sentence structures and vocabulary, yet still pass the test as long as the fundamental semantic meaning remains intact.

We consider a test passed only when it passes all three layers.

Decoupling logic from data: The JSON test case structure

One of the most critical architectural decisions we made early on was to strictly separate the test execution logic from the test data and validation rules. Rather than hardcoding test scenarios into Python scripts, we externalized everything into a structured JSON file.

This created a pristine separation of concerns: the Python runner handles the how (transport and interpretation), while the JSON file defines the what (the inputs and the quality gates).

Each test case is a self-contained JSON object that acts as a comprehensive contract for a specific chat interaction.

Pros and scalability of such approach:

  • Zero-code onboarding: The primary advantage is accessibility. Business analysts, product managers, or junior QA engineers can write, modify, and review test cases without needing to understand WebSockets, Python, or Sentence Transformers.They just update the JSON.
  • Infinite horizontal scalability: Because the runner iterates through a standard JSON array, scaling the test suite from 10 cases to 10,000 cases requires zero architectural changes to the underlying Python code.
  • Version control friendly: JSON files diff beautifully in Git. We can track exactly when a synonym was added or when an expected_answer was updated to reflect a new product feature.

Test runner and reporting

We built a custom CLI runner that parses the `test_cases.json` file and executes the suite.

To aid in debugging, we utilized `colorama` and regular expressions to strip out HTML tags and dynamically highlight detected keywords and synonyms in bright green directly in the terminal output. This allows QA engineers to visually verify why a test passed or failed at a glance.

Finally, execution metrics (Test ID, Pass/Fail status, and response duration in seconds) are continuously appended to a results log file, allowing us to track performance latency and regression metrics over time.

Results

Testing AI-powered systems requires thinking beyond traditional binary assertions.

Deploying this custom framework fundamentally transformed how our team approaches AI quality assurance. We moved away from the tedious manual testing that plagues many early-stage AI projects and replaced it with a more deterministic, data-driven pipeline.

By combining strict keyword validation with semantic evaluation, we achieved a safety net that is both flexible and rigorous. This is the foundation for gathering hard metrics, such as latency, similarity scores, and hallucination catch-rates.

What’s next? 

While the current architecture handles single-turn queries beautifully, the next frontier is stateful, multi-turn conversations. We can evolve the framework to work in long contextual states, evaluating how well the bot remembers facts established three or four messages prior. Furthermore, we are looking into integrating dynamic LLM-as-a-Judge mechanisms, where a secondary model acts as the final arbiter for chatbot responses.

The system also can be extended to load and concurrency testing. By parallelizing the test suite across multiple independent chat sessions, we can simulate real-world usage patterns and evaluate system behavior under concurrent requests. This enables measurement of performance characteristics such as response latency, throughput, and stability.

Testing AI requires discarding the comfort of absolute determinism. By building frameworks that are as intelligent and adaptable as the systems they evaluate, our QA can stop playing catch-up and start leading the charge in building reliable AI products.

Technologies used: Python, WebSockets, SignalR, PyTorch, Sentence Transformers (NLP), Scikit-learn, JSON, Regex.

Nesterenko QA Chief
Need help with AI testing?

Testing LLM-based systems requires more than traditional QA approaches. A structured validation strategy can help you detect hallucinations and improve response reliability in production AI applications.

Siarhei Nestsiarenka, Chief QA

Editorial Guidelines
Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>