In a recent discussion, Ryan from SmartBear spoke with Fitz Nowlan, VP of AI and Architecture, about the evolving landscape of software development. As large language models (LLMs) and AI-driven agents become central to coding, traditional testing assumptions are upended. The challenge? Code generated by LLMs often arrives as a black box—its internal logic unknown, and its behavior non-deterministic. This has sparked a shift toward relying on data locality and construction rather than source code analysis. Below, we explore key questions about testing such systems, the pitfalls of non-determinism, and what new practices are emerging.
Why does traditional testing fail with AI-generated code?
Traditional testing relies on knowing the expected behavior of each unit, function, or module. With code written by LLMs, you often receive a complete solution without insight into its internal structure. The model may produce different outputs for the same input due to probabilistic sampling, breaking the deterministic foundation of unit and integration tests. Moreover, the code may be syntactically correct but semantically misaligned with the problem. Static analysis tools become less useful because they assume predictable relationships between inputs and outputs. As Fitz Nowlan noted, we must move away from the assumption that source code is the primary artifact to test; instead, we focus on what the code does with real data.

How does non-determinism affect testing MCP servers?
MCP (Model Context Protocol) servers act as intermediaries between AI agents and tools or data. They coordinate multiple LLM calls, each potentially returning different results. This non-determinism means a test that passes once may fail on the next run without any code change. Traditional mock-based testing assumes you can predict and stub responses, but with LLMs, the range of valid outputs is vast. Instead, observability and logging become critical: you must capture actual execution traces and compare them against expected behavior patterns rather than exact values. Data locality—running tests close to where the data is produced—helps isolate variability caused by network latency or model updates. The trick is to design tests that check invariants (e.g., “response contains a valid JSON structure”) instead of exact matches.
What role does data locality play in testing unknown code?
When you don’t know the source code, the next best thing is to test using representative data at the point of execution. Data locality means running tests in the same environment where the code will operate—same database state, same file system, same external service endpoints. This increases reproducibility because it reduces environmental drift. For AI-generated code, locality also enables you to capture realistic request-response pairs. By constructing a rich set of test data (both typical and edge cases), you can validate that the unknown code handles a wide range of scenarios. Fitz Nowlan emphasized that data construction—crafting purposeful datasets—has become more valuable than reverse-engineering the algorithm. The data becomes the specification.
How can data construction replace source code analysis?
In the past, testers would examine source code to identify branches, loops, and error paths. With LLM-generated code, that information may be hidden or change each time the model is queried. Data construction shifts focus to the inputs and outputs: you create a dataset that covers all intended behaviors, including boundary values, invalid inputs, and typical use cases. Then you run the code against this dataset and measure whether outputs meet expectations (e.g., accuracy, format, latency). This approach works regardless of the internal code structure. It aligns with black-box testing but goes further by designing data that forces the model to reveal its hidden logic. For example, if you need to test a sorting algorithm, you generate lists with duplicates, reverse order, and single elements rather than inspecting the sorting routine itself.

What new assumptions about software development are emerging?
The old assumption was that developers understand and control every line of code. Now, with AI agents writing significant portions, we assume the code is a generated artifact that may be ephemeral. Consequently, testing shifts from verifying code correctness to validating system behavior at runtime. Another assumption is that deterministic tests are the only reliable kind; instead, we accept probabilistic testing where you define acceptable success rates (e.g., 95% of responses are valid). Fitz Nowlan highlighted that the development lifecycle is becoming more iterative: generate, test, regenerate. This requires robust scaffolding for data locality and logging. Finally, the boundary between “development” and “operations” blurs as testing focuses on live system performance rather than pre-deployment checks.
How do LLM-driven agents break traditional testing?
LLM-driven agents interact with multiple tools and APIs, making decisions based on context and prior turns. Traditional testing assumes a linear execution path with predictable state changes. Agents introduce non-linear flows: they may decide to call one API in one run and a different one in another, depending on the model’s internal state. This breaks mock-based tests because the agent’s choices are not deterministic. Furthermore, agents often generate code on the fly (e.g., writing SQL queries), which cannot be tested statically. The solution is to test at the outcome level: did the agent achieve its goal? Did it do so within safety constraints? This requires scenario-based testing with realistic data, where pass/fail criteria are defined holistically rather than per-step.
What strategies exist for testing code with unknown internals?
Several practical strategies have emerged. First, black-box behavior testing using a rich dataset as outlined earlier. Second, contract testing—define interfaces (e.g., “input must be JSON, output must contain an ‘id’ field”) and verify them automatically. Third, chaos engineering for LLM systems: inject failures or out-of-distribution inputs to see how the code reacts. Fourth, observability-driven development—instrument the system with logs, metrics, and traces to capture behavior during production runs, then replay those traces in test environments. Fifth, use data locality to increase reproducibility. Finally, adopt A/B evaluation where you compare outputs from two different model versions or prompt strategies. These techniques shift focus from “is the code correct?” to “does the system work as expected for its users?”